Question: BWA for SOLiD
0
gravatar for lrutter
4.6 years ago by
lrutter0
United States
lrutter0 wrote:

Hello:

I am receiving an error trying to perform the "Manipulate FASTQ" on my .fastq file. My original .fastq file (SRR452355b.fastq) was in Solid colorspace.

My main goal is to convert that .fastq file to a SAM file. I thought the best way to do this was with "BWA for SOLiD" using my reference transcriptome sequence (mrna.fa). 

However, it was noted that before I could use BWA, I would have to perform two things to prepare the .fastq file:

1) Run FASTQ Groomer on it
2) Run "Manipulate FASTQ" on it with the following settings:
- Manipulate reads on Sequence Content, choosing Change Adapter Base, and having the text box empty.
- Manipulate reads on Sequence Content, doing a String Translate from "01234." to "ACGTN".

So, I completed the first step (specifically, I Groomed the .fastq file with input FASTQ quality scores at Color Space Sanger).

Then, I used the FASTQ Groomer file as input to "Manipulate FASTQ", and received the error currently reported.

############################
Traceback (most recent call last):
  File "/galaxy/main/migrated_tools/toolshed.g2.bx.psu.edu/repos/devteam/fastq_manipulation/5d1e9e13e8db/fastq_manipulation/fastq_manipulation.py", line 37, in <module>
    main()
  File "/galaxy/main/migrated_tools/toolshed.g2.bx.psu.edu/repos/devteam/fastq_manipulation/5d1e9e13e8db/fastq_manipulation/fastq_manipulation.py", line 25, in main
    new_read = fastq_manipulator.match_and_manipulate_read( fastq_read )
  File "/galaxy/main/jobdir/006/898/6898374/tmpAqYK6k", line 17, in match_and_manipulate_read
    new_read = manipulate_read( fastq_read )
  File "/galaxy/main/jobdir/006/898/6898374/tmpAqYK6k", line 10, in manipulate_read
    new_read.sequence = new_read.sequence.translate( maketrans( binascii.unhexlify( "2230313233342e22" ), binascii.unhexlify( "22414347544e22" ) ) )
ValueError: maketrans arguments must have same length
############################

I tried to repeat the same process, only now instead of doing:

- Manipulate reads on Sequence Content, doing a String Translate from "01234." to "ACGTN"

I did:

- Manipulate reads on Sequence Content, doing a String Translate from "0123." to "ACGTN". (I took out the 4).

 

Even though this goes against what is advised on BioStar page for the BWA SOLiD, I did successfully generate a manipulated FASTQ dataset.

 

However, when I input that manipulated FASTQ dataset into BWA SOLiD, it does not list it as an option for the "FASTQ file (Nucleotide-space recoded from color-space)". So, I think I must have done something wrong with the manipulated FASTQ file (In fact, I notice that it has many "N" values, so I think my String Translate may be the problem).

 

As a side note: The BWA SOLiD page does not give me an error. Only when I select "Select a reference from history", and input my transcriptome file does it give me an error (in addition to not allowing me an option to input the FASTQ file).

 

How should I begin to troubleshoot this?

 

Many thanks...

 

 

bwa • 1.8k views
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by lrutter0

Jen:

Thanks so much for your help! I did attempt to try the option you suggested on my original .fastq file.

All I have in my history right now is my original .fastq file and my mrna.fa transcriptome reference file.

When I opened the "NGS: QC and Manipulation" --> "AB-SOLID DATA" --> Convert SOLiD output to fastq, I received a message that "history does not include a dataset of the required format/build" for both the reads and qualities. I am not sure what steps I should take to appropriately generate these two required input files?

My .fastq file is of the format:

 

@SRR452355.1 1_12_954_F3 length=50
T02..213110.3210.332.2000.3013133201212103322210030
+SRR452355.1 1_12_954_F3 length=50
!);!!):=::2!;.:2!;/3!2/%0!1;$&)18)4&#1*,$7(%63+4'54
@SRR452355.11 1_14_989_F3 length=50
T300231203003223.3002100331013123232021110301113231

 

but the program wants file input of the type:

Reads:

>1831_573_1004_F3
T00030133312212111300011021310132222
>1831_573_1567_F3
T03330322230322112131010221102122113

Quality scores:

>1831_573_1004_F3
4 29 34 34 32 32 24 24 20 17 10 34 29 20 34 13 30 34 22 24 11 28 19 17 34 17 24 17 25 34 7 24 14 12 22
>1831_573_1567_F3
8 26 31 31 16 22 30 31 28 29 22 30 30 31 32 23 30 28 28 31 19 32 30 32 19 8 32 10 13 6 32 10 6 16 11

 

Many thanks again for your help!

ADD REPLYlink written 4.6 years ago by lrutter0

Yes, I see - this can go straight into the FASTQ Groomer tool (which I know you have tried).

But next time, the quickest way to get SRR data is directly from the ENA SRA - and import it in .fastqsanger format to Galaxy using the application-specific link. Use "Get Data -> EBI SRA", enter the accession "SRR452355", follow the links to the table with the Galaxy link(s). 

It looks as if you tried this for some of your earlier datasets? It is a good method.

 

ADD REPLYlink written 4.6 years ago by Jennifer Hillman Jackson25k

Thank you Jen. I will look into the ENA SRA option for  my next datasets.

I am a bit confused about your suggestion though. 

Are you saying the order is: 1) Put my .fastq file directly into Groomer 2) Put the groomed .fastq file into the NGS:QC and Manipulation, 3) Put the groomed&mainpulated .fastq file into "BWA for SOLiD" to get the SAM file?

Because it seems that step (2) requires both a .fastq file and quality scores?

ADD REPLYlink written 4.6 years ago by lrutter0
1

Since you data is already in .fastq format (combined .fasta & .qual) leave it that way and work with tools as you are doing, this is correct. All SOLiD data does not enter Galaxy in the format your's did, often something has been done first to the original files, which is where my initial answer came from.

Run the Fastq Groomer tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest). Then run Manipulation Reads to convert to base-space and remove the adaptor itself - using the instruction on that tool, same as what you are doing "0123. -> ATCGN" (the instruction on the BWA tool form with the added "4" are incorrect, we'll fix that). 

You should end up with a dataset with a datatype assigned as "fastqcssanger" (first assigned by the groomer and carried through the manipulate step) which will be recognized by the tool "Map with BWA for SOLiD" as input. 

For the other issues: A custom reference genome needs to have the datatype assigned as "fasta" to be recognized by any tool. I see one in your history. This is a file of over 2M mRNA sequences, which may likely present with a memory error (exceeding resources) on the public server. My guess is that it will fail at the indexing step. But it is worth a test. If it does fail, running the job on a cloud Galaxy with more dedicated memory is the next step.

http://usegalaxy.org/cloud
http://usegalaxy.org/toolshed (to obtain tools not in the default ami install)

Thanks for sending in the bug report, this helped to see what was going on. Hopefully you have a useable workflow now! - Jen

ADD REPLYlink written 4.6 years ago by Jennifer Hillman Jackson25k

Jenn:

Thank you for your help!

I am doing this on a time-constraint and do not mind if my results are not terribly accurate. In fact, my .fastq files represent only a random 1/10 lines from the original file because I wanted to cut down on the time significantly at the expense of quality (This is for a quick project, not publication! :o))

Do you think there is an appropriate way I can reduce the size of the mrna.fa file, so that I do not have to use the cloud resource for this quick project? I think the .fa file is 533.3MB (zipped it is 177.3MB). This is the format of the file:

>AF001540 1
ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
>AF001541 1
ttcggcacaggnatacttttagaagaaaaaagataaatttaaacctgaaa
agtaggaagcagaagaaaaaagacaagctaggaaacaaaaagctaagggc

 

 

I am not sure which step you mean by "My guess is that it will fail at the indexing step". But, I will go ahead and, as you suggest, try to run "BWA for SOLiD". However, in case it does prove too large, I would love to hear any advice regarding decreasing this file (i.e. to what size is appropriate, and how)?

Thanks so much again for your help, Jen!

ADD REPLYlink written 4.6 years ago by lrutter0

I just submitted the "BWA for SOLiD". I am not sure how long it is approximated to take? But I will update. Thanks again :o)

ADD REPLYlink written 4.6 years ago by lrutter0

It looks like it went through the "BWA for SOLiD" and created a SAM file! I have five more of these files, though. We'll see if they all go through. Thanks again for your help, Jen!

ADD REPLYlink written 4.6 years ago by lrutter0
1

Please accept the answer if it helped you solve your question.

ADD REPLYlink written 4.6 years ago by Martin Čech ♦♦ 4.9k
3
gravatar for Jennifer Hillman Jackson
4.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Please try the tool "NGS: QC and Manipulation -> AB-SOLID DATA -> Convert SOLiD output to fastq" as an alternative (better option).

Best, Jen, Galaxy team

ADD COMMENTlink written 4.6 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour