Question: importing gzipped files without having them decompressed
0
gravatar for Wolfgang Maier
4.5 years ago by
Germany
Wolfgang Maier600 wrote:

Hi,
I am regularly uploading huge gzipped files (WGS fastq data) to our local instance of Galaxy. The software that we have wrapped in Galaxy can deal with gzipped fastq data, so there is no need to decompress them.
By default, gzipped files will get extracted during the import, but since these are large files I'm importing them as linked datasets anyway, which also prevents them from being decompressed.
However, Galaxy still tries to figure out the format of the imported datasets, so even though I am just trying to generate links, it inspects the contents of the file to auto-detect the format, which makes the import very slow.
In the end, it decides that it can't do anything with the dataset, sets its format to "data", but also removes the .gz extension from its name.
This naming issue can be solved afterwards by just editing the dataset information and "data" is ok for me as format, but isn't it possible somehow to:

1) declare the file format as "data" right-away instead of going through auto-detection ? ("data" is not offered as a format in the import dialogue, and declaring some arbitrary type just to avoid auto-detection seems odd)

2) prevent Galaxy from stripping the '.gz' from the file name ?

or alternatively:

Could you define a new format that when selected prevents Galaxy from doing anything with the dataset and just makes it import the file as is?

Thanks for your help,
Wolfgang

 

data format gzip • 1.5k views
ADD COMMENTlink modified 4.5 years ago by Bjoern Gruening5.1k • written 4.5 years ago by Wolfgang Maier600
5
gravatar for Bjoern Gruening
4.5 years ago by
Bjoern Gruening5.1k
Germany
Bjoern Gruening5.1k wrote:

Hi Wolfgang,

you can add the following line to the datatypes_conf.xml file in your Galaxy root.

<datatype extension="gzip" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" subclass="True" display_in_upload="true" />

After restarting Galaxy you should be able to select gzip as binary format.

Hope that helps,

Bjoern

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Bjoern Gruening5.1k

Hi Björn,

I just tried your suggestion and added

<datatype extension="gzip" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" subclass="True" display_in_upload="true" />

to our datatypes_conf.xml.

It works reasonably well for me although it seems to be just a partial solution. I can now choose the new gzip type as the data format during an import and this bypasses the format auto-detection so the upload of linked files is lightning-fast now :)

However, Galaxy still removes the .gz name extension, so I have to add it back manually still.

Also, (as I said I am not usually doing this, but tried it now) if I add the file as a copy to Galaxy instead of linking it, it still gets extracted, but worse than before, if I declare the format to be gzip now, Galaxy claims that the extracted file is in binary format.

Any ideas on how to improve this situation?

Best,
Wolfgang

ADD REPLYlink written 4.5 years ago by Wolfgang Maier600

Hi Wolfgang,

sorry it was meant as temporary workaround. Please vote on that card https://trello.com/c/3RkTDnIn hopefully your use case will be fixed soon.

Best,

Bjoern

ADD REPLYlink written 4.5 years ago by Bjoern Gruening5.1k

Hi Björn,

thanks for pointing me to the card - I've voted. I can live with the current situation, I just wanted to be sure that there really is no better solution that I had overlooked.

Thanks a lot for your help!

Wolfgang

ADD REPLYlink written 4.5 years ago by Wolfgang Maier600
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 181 users visited in the last hour