Question: what is AB SOLiD System 2.0 in sequencing ?
2
gravatar for Mo
12 weeks ago by
Mo40
/
Mo40 wrote:

Hello,

Actually I want to know if such data should be analyzed the same way as those acquired with Illumina or ... I am trying to re-analyse this https://www.ebi.ac.uk/ena/data/view/SRX015657

Thanks for your comment

solid fastq illumina rna-seq • 338 views
ADD COMMENTlink modified 12 weeks ago by Jennifer Hillman Jackson25k • written 12 weeks ago by Mo40

I'm running a test to see how the data downloads and what datatype Galaxy assigns it. Most data from this source, when imported using the Fastq > Send to Galaxy links, will be transformed to have Sanger Fastq +33 scaled quality scores (fastqsanger). My test history is here if curious (test downloads are still running right now): https://usegalaxy.org/u/jen/h/test-history-ebi-sra-solid-fastq

The data is actually is in a compressed format, meaning the datatype should be assigned a "[datatype].gz" format, not "[datatype]" but there is a bug with the tool requiring one to reassign the datatype before using downstream tools. In most cases, this means reassigning the datatype from fastqsanger to fastqsanger.gz. That bug and the workaround details are covered in this prior post: https://biostar.usegalaxy.org/p/28718/

But let's see how these tests download first, then I'll give definitive advice about how to proceed. Most of Galaxy's tools no longer work with color-space (solid) data directly anymore. But there are ways to transform from solid fastq to sanger fastq within Galaxy (with some information loss, unfortunately). If needed, we'll explain how to do that in the final reply.

To better understand different fastq types, review the help on the FastqGroomer and/or Manipulate Fastq tool forms.

Thanks! Jen, Galaxy team

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Jennifer Hillman Jackson25k

Well, the downloaded data was given the datatype "fastqsanger" even though it is in color space (not base space). Color space data is not used much anymore and the datatype sniffer or possibly the EBI SRA tool itself needs to be tuned to handle this type of uploaded content.

I'm going to test a few more things to see what will transform it into base space reads with the proper quality score encoding with the fewest steps. Then will open a development ticket(s) to prevent this from happening going forward. I'll write back with a link to that and the best workaround for you to use now. Feedback will likely be on Tuesday unless I can get to it over the holiday weekend, or someone else on this forum helps you first.

Thanks for your patience, Jen

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Jennifer Hillman Jackson25k

Update: Still working on this and may or may not be able to give you a solution that works at Galaxy Main https://usegalaxy.org. I'll know after the new tests I have going finish (the first round failed and those tools/wrappers are unlikely to be updated for this purpose).

Why is this a problem?

Galaxy Main no longer supports tools/indexes for color-space analysis. The technology that created this data is much older and tools are not processing it as expected (as originally). Converting to an "Illumina-like" format is not the best way to process the data anyway. It is better to stay in color-space.

That said, there are wrappers in the Galaxy Tool Shed https://usegalaxy.org/toolshed that do work with color-space data directly. These could be used in a local Galaxy. Some are older, so if you decide to try this, proceed with the expectation that some testing will be needed and that the tool wrapper authors may no longer be supporting it (meaning, no changes if buggy).

More feedback soon and thanks for confirming that the bug report sent in was yours. An error message produced by a known issue with the Get Data > EBI SRA tool is identical to the one you encountered -- but the solution (if one can be found) will be different.

ADD REPLYlink written 12 weeks ago by Jennifer Hillman Jackson25k
0
gravatar for Jennifer Hillman Jackson
12 weeks ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

This solution will work for you case.

  1. Change the datatype of the EBI SRA fastq datasets to fastcssanger.gz. Do this for all datasets loaded from this source that are in color-space. The assigned datatype can be modified on the Edit Attribute forms under the "Datatype" tab (reach these forms by clicking on the pencil icon, per-dataset). Be sure to save after each change. NOTE: fastqsanger.gz is different than fastqcssanger.gz, so double check for the "cs" in the naming when making the change.
  2. Go back to the Edit Attributes forms and under the "Convert" tab, convert compressed to uncompressed -- and be sure to save again. This made the Fastq Groomer tool run faster for me. Given the data size, it might be necessary - or you can try skipping to step 3. Be aware that this processing will take many hours to run. My tests with compressed inputs have been running for 12 hours, have not completed, and still have the potential to fail (for exceeding memory/time resources).
  3. Use the tool Fastq Groomer, select those fastqcssanger or fastqcssanger.gz datasets, and set the option Input FASTQ quality scores type to Color Space Sanger. Leave the remaining options at default. The result will be data converted to and assigned to either the fastqsanger or fastqsanger.gz format/datatype. Both can be used as inputs with most, but not all, tools (some older tools still require uncompressed inputs, example: Tophat won't accept compressed fastq data when in a dataset collection).
  4. Uncompressed fastq data can always be converted back to compressed to save space. It is the combined functions of uncompressing, doing the data change from color to base space, then compressing again -- all with the Fastq Groomer tool in one step -- that may present resource problems. Doing the manipulations distinctly has a greater chance of success. But you can try both. Sicking with compressed data has advantages (will save quota space), although you can always permanently delete (purge) those intermediate datasets later to get the space used back.

Thanks for your patience during testing. I wanted to make sure this would actually still work, and with your exact given data.

Jen, Galaxy team

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Jennifer Hillman Jackson25k

@Jennifer Hillman Jackson Thanks a lot for your detailed explanation. I have gone through step 1 to step 3 . However, when I try to use HISAT2, it does not recognises the output. Can you please tell me how I should do it? Thanks a lot

ADD REPLYlink written 11 weeks ago by Mo40

Hi -

Step 1 was not executed correctly. It is easy to mix these up due to the similarity in the datatype names.

The problem is with the datatype assigned to datasets 5 & 6. Those are now uncompressed but are assigned the datatype fastqsanger instead of fastqcssanger. This datatype mix-up confuses the Fastq Groomer tool, so it didn't really do the transformation in datasets 7 & 8, instead just passed the data through, re-detected the datatype, resulting in output given the datatype fastqcssanger (which is correct, but not accepted by tools as input).

To fix the data from where you are now, change the datatype for dataset 5 & 6 to fastqcssanger then run the data through the Fastq Groomer tool again the same way as you did before. The output will then be transformed to fastqsanger, given that datatype, and tools will accept the data as input.

Next time, for color-space data, assign the datatype fastqcssanger.gz right after importing from EBI SRA (step 1 above). It looks like you assigned fastqsanger.gz, which is what should be done for base-space data (with no Groomer step needed). I captured the incorrect color-space datatype detection in the master EBI SRA issue ticket here: https://github.com/galaxyproject/galaxy/issues/6334#issuecomment-419205445 (includes my most recent test history for this use case)

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Jennifer Hillman Jackson25k

@Jennifer Hillman Jackson I started from very begining step, when I use Fastq Groomer and only those uncompressed ones, it gives an error but when I put everything there, it start running, is that supposed to be like that ?

ADD REPLYlink written 11 weeks ago by Mo40

Did you compare what you did with the test history I shared using one of your accessions? It follows the same steps plus a few others (I deleted any data/tools that won't work).

ADD REPLYlink written 11 weeks ago by Jennifer Hillman Jackson25k

@Jennifer Hillman Jackson I followed exactly what you told me above, when used Grommer, it didnot work with uncomressed files, so I have to have 4 data (2 my uncompressed and 2 of the ones that I chnaged their data types) It is because they are paired-ended so I have 4. afterwards, it gave me one file which again I cannot use it for HISAT2!!! It does not recognises it.

I don't know which histroy, you gave me once a history which only showed that you uploaded the files, can you please direct me to the histroy again ?

ADD REPLYlink written 11 weeks ago by Mo40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour