Hello,
Actually I want to know if such data should be analyzed the same way as those acquired with Illumina or ... I am trying to re-analyse this https://www.ebi.ac.uk/ena/data/view/SRX015657
Thanks for your comment
Hello,
Actually I want to know if such data should be analyzed the same way as those acquired with Illumina or ... I am trying to re-analyse this https://www.ebi.ac.uk/ena/data/view/SRX015657
Thanks for your comment
Hello,
This solution will work for you case.
fastcssanger.gz
. Do this for all datasets loaded from this source that are in color-space. The assigned datatype can be modified on the Edit Attribute forms under the "Datatype" tab (reach these forms by clicking on the pencil icon, per-dataset). Be sure to save after each change. NOTE: fastqsanger.gz
is different than fastqcssanger.gz
, so double check for the "cs" in the naming when making the change.fastqcssanger
or fastqcssanger.gz
datasets, and set the option Input FASTQ quality scores type to Color Space Sanger. Leave the remaining options at default. The result will be data converted to and assigned to either the fastqsanger
or fastqsanger.gz
format/datatype. Both can be used as inputs with most, but not all, tools (some older tools still require uncompressed inputs, example: Tophat won't accept compressed fastq data when in a dataset collection).Thanks for your patience during testing. I wanted to make sure this would actually still work, and with your exact given data.
Jen, Galaxy team
@Jennifer Hillman Jackson Thanks a lot for your detailed explanation. I have gone through step 1 to step 3 . However, when I try to use HISAT2, it does not recognises the output. Can you please tell me how I should do it? Thanks a lot
Hi -
Step 1 was not executed correctly. It is easy to mix these up due to the similarity in the datatype names.
The problem is with the datatype assigned to datasets 5 & 6. Those are now uncompressed but are assigned the datatype fastqsanger
instead of fastqcssanger
. This datatype mix-up confuses the Fastq Groomer tool, so it didn't really do the transformation in datasets 7 & 8, instead just passed the data through, re-detected the datatype, resulting in output given the datatype fastqcssanger
(which is correct, but not accepted by tools as input).
To fix the data from where you are now, change the datatype for dataset 5 & 6 to fastqcssanger
then run the data through the Fastq Groomer tool again the same way as you did before. The output will then be transformed to fastqsanger
, given that datatype, and tools will accept the data as input.
Next time, for color-space data, assign the datatype fastqcssanger.gz
right after importing from EBI SRA (step 1 above). It looks like you assigned fastqsanger.gz
, which is what should be done for base-space data (with no Groomer step needed). I captured the incorrect color-space datatype detection in the master EBI SRA issue ticket here: https://github.com/galaxyproject/galaxy/issues/6334#issuecomment-419205445 (includes my most recent test history for this use case)
@Jennifer Hillman Jackson I started from very begining step, when I use Fastq Groomer and only those uncompressed ones, it gives an error but when I put everything there, it start running, is that supposed to be like that ?
Did you compare what you did with the test history I shared using one of your accessions? It follows the same steps plus a few others (I deleted any data/tools that won't work).
@Jennifer Hillman Jackson I followed exactly what you told me above, when used Grommer, it didnot work with uncomressed files, so I have to have 4 data (2 my uncompressed and 2 of the ones that I chnaged their data types) It is because they are paired-ended so I have 4. afterwards, it gave me one file which again I cannot use it for HISAT2!!! It does not recognises it.
I don't know which histroy, you gave me once a history which only showed that you uploaded the files, can you please direct me to the histroy again ?
I'm running a test to see how the data downloads and what datatype Galaxy assigns it. Most data from this source, when imported using the Fastq > Send to Galaxy links, will be transformed to have Sanger Fastq +33 scaled quality scores (fastqsanger). My test history is here if curious (test downloads are still running right now): https://usegalaxy.org/u/jen/h/test-history-ebi-sra-solid-fastq
The data is actually is in a compressed format, meaning the datatype should be assigned a "[datatype].gz" format, not "[datatype]" but there is a bug with the tool requiring one to reassign the datatype before using downstream tools. In most cases, this means reassigning the datatype from fastqsanger to fastqsanger.gz. That bug and the workaround details are covered in this prior post: https://biostar.usegalaxy.org/p/28718/
But let's see how these tests download first, then I'll give definitive advice about how to proceed. Most of Galaxy's tools no longer work with color-space (solid) data directly anymore. But there are ways to transform from solid fastq to sanger fastq within Galaxy (with some information loss, unfortunately). If needed, we'll explain how to do that in the final reply.
To better understand different fastq types, review the help on the FastqGroomer and/or Manipulate Fastq tool forms.
Thanks! Jen, Galaxy team
Well, the downloaded data was given the datatype "fastqsanger" even though it is in color space (not base space). Color space data is not used much anymore and the datatype sniffer or possibly the EBI SRA tool itself needs to be tuned to handle this type of uploaded content.
I'm going to test a few more things to see what will transform it into base space reads with the proper quality score encoding with the fewest steps. Then will open a development ticket(s) to prevent this from happening going forward. I'll write back with a link to that and the best workaround for you to use now. Feedback will likely be on Tuesday unless I can get to it over the holiday weekend, or someone else on this forum helps you first.
Thanks for your patience, Jen
Update: Still working on this and may or may not be able to give you a solution that works at Galaxy Main https://usegalaxy.org. I'll know after the new tests I have going finish (the first round failed and those tools/wrappers are unlikely to be updated for this purpose).
Why is this a problem?
Galaxy Main no longer supports tools/indexes for color-space analysis. The technology that created this data is much older and tools are not processing it as expected (as originally). Converting to an "Illumina-like" format is not the best way to process the data anyway. It is better to stay in color-space.
That said, there are wrappers in the Galaxy Tool Shed https://usegalaxy.org/toolshed that do work with color-space data directly. These could be used in a local Galaxy. Some are older, so if you decide to try this, proceed with the expectation that some testing will be needed and that the tool wrapper authors may no longer be supporting it (meaning, no changes if buggy).
More feedback soon and thanks for confirming that the bug report sent in was yours. An error message produced by a known issue with the Get Data > EBI SRA tool is identical to the one you encountered -- but the solution (if one can be found) will be different.