Question: Error with "Trim leading or trailing characters" tool
0
gravatar for yena.oh
2.6 years ago by
yena.oh70
Canada
yena.oh70 wrote:

Hi all,

I have been trying to trim my fastq files, and I am running into problems. I used the "Trim leading or trailing characters" tool to trim the first 10bps of my reads (from 51bp -> 41bp for each read) with the following conditions:

this dataset    LV_12_S24_L005_R1_001.fastq 
Trim this column only   0   
Trim from the beginning up to this position 11  
Remove everything from this position to the end 50  
Is input dataset in fastq format?   Yes 
Ignore lines beginning with these characters

After I run this, I get an error message saying:

Traceback (most recent call last):
  File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/trimmer.py", line 111, in <module>
    main()
  File "/cvmfs/main.galaxyproject.org/galaxy/tools/filters/trimmer.py", line 75, in main
    invalid_starts[i] = chr( int( item ) )
ValueError: invalid literal for int() with base 10: '-q'

I have previously used the same tool to trim sequences for my previous experiments, and had no problem. Strangely, when I try to repeat exactly the same trimming process using the same parameters and same fastq file, I cannot get the trim to work.

Any help would be appreciated,

Yena

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by yena.oh70

Hello Yena,

If you are working at http://usegalaxy.org, would you please send in this error as a bug report? I suspect there is an input problem, but would like to confirm and share how to resolve. Please include a link to this Biostars post in the comments and leave all input/error datasets undeleted.

If working elsewhere, see if you can reproduce on the Main Galaxy server for the best feedback. If you cannot reproduce the error there, that is also informative.

Thanks, Jen, Galaxy team

ADD REPLYlink written 2.6 years ago by Jennifer Hillman Jackson25k
1

Update: I was able to reproduce this with an independent test. We are investigating; no need to send in a bug report. More feedback soon. Thanks for reporting the problem! Jen

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Jennifer Hillman Jackson25k
1
gravatar for Jennifer Hillman Jackson
2.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

There is a new problem with the option to specify Fastq input. This ticket captures the details and a work-around for current use: https://github.com/galaxyproject/galaxy/issues/2245

Track fix promotion to Galaxy Main here: https://github.com/jennaj/support-known-issues/wiki

Thanks again for reporting the problem! Jen, Galaxy team

ADD COMMENTlink written 2.6 years ago by Jennifer Hillman Jackson25k

Hi Jen,

Thanks for the feedback. I have tried out the directions written under "WORK-around" section, which suggested the following parameters:

Is input dataset in fastq format?   **No**  (instead of yes)
Ignore lines beginning with these characters    **@ +**

The trim process worked fine without any errors. Now the problem lies with downstream processing - FASTQ Groomer. Since all lines beginning with @ and + were set to be "ignored," sequences for which the Sanger Quality Score Value begins with "@ or +" are not trimmed, hence leading to a mismatch in the sequence length. This leads to the following error (example):

The reported error is: 'Invalid FASTQ file: quality score length (51) does not match sequence length (41)'.
The last valid FASTQ read had an identifier of '@D00124:312:C877CANXX:4:2309:4905:2066 1:N:0:CGATGT'.
The error in your file occurs between lines '321' and '324', which corresponds to byte-offsets '11040' and '11188', and contains the text (148 of 148 bytes shown):

@D00124:312:C877CANXX:4:2309:4795:2099 1:N:0:CGATGT
GCACTTCCTGCTCTGCGATGAGCGGAGAAGCAGCAGCGTCC
+
@BB00EFGGGGGCGGGGGGGGG@D@GGGGGGGGGGFGFGGGGEGGGFGGGG

Any suggestions?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by yena.oh70

Oh rats, that is one of the gotchas. Maybe not such a great work-around. I'll remove it from Github.

The true alternative, for now, is to use another trimming tool. There are many in the tool group "NGS: QC and manipulation" designed specifically for Fastq data. The Trim tool used was not created for that exact purpose originally.

ADD REPLYlink written 2.6 years ago by Jennifer Hillman Jackson25k

Thanks for your help Jen,

Just as a followup, for other methods of trimming(i.e. FASTQ Trimmer or Trim Sequences tools), raw .fastq files cannot be used as input files. They must be formatted to Fastqsanger through Groomer.

With respects to this, is there any difference between :

  • Trim -> Groomer -> Tophat

  • Groomer -> Trim -> Tophat

Thanks, Yena

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by yena.oh70

If the trimming tool used trims based on quality score values, then grooming first is needed to ensure that these are interpreted correctly. The way you were using this tool did not do that - but some of the others do.

Most tools expect .fastqsanger format as input, so converting the quality scores over is a good idea anyway. That said, your data appears to be already in .fastqsanger format. Run FastQC on the dataset first if you want to double check. If it is, there is no need to groom, just assign the more specific fastq type. If there are many files of the same type, the datatype can be assigned upon upload in batch.

https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

ADD REPLYlink written 2.6 years ago by Jennifer Hillman Jackson25k

Hi Jen,

I have 12 samples that were run across 2 lanes. Hence, to process these resulting 24 fastq files, I have done FASTQ Trimmer -> Concatenate -> Groomer on my fastq files, followed by Tophat. With the tophat, I seem to be getting an error for some of the output files(most align summary and accepted hits files) with the following message:

An error occurred setting the metadata for this dataset
Set it manually or retry auto-detection

It seems like the error is not specific to any particular tophat output files, as some accepted_hits files do not show the error. This error prevents me from visualizing the reads alignment on IGV or IGB, showing me a webpage with the following error description:

Conflict

There was a conflict when trying to complete your request. 
Error generating display_link: type object 'Bam' has no attribute 'name'

I experienced the same error when I tried to trim the sequences using another trim tool, "Trim sequences," and from other posts, cufflinks won't be able to successfully run either. Do you have any ideas how to fix this? I have also tried manually downloading the tophat outputs, and opening them, but this did not work.

Thanks,

Yena

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by yena.oh70

Problem fixed:

Click on "Set it manually or retry auto-detection" or "edit attributes," and "auto-detect."

If auto-detect does not work and instead shows the following error: "This dataset is currently being used as input or output. You cannot change metadata until the jobs have completed or you have canceled them."

Permanently delete any uncompleted analysis (from all hidden or deleted datasets) that use this specific dataset. If the dataset is not being used as any input or output, try re-running the analysis.

ADD REPLYlink written 2.6 years ago by yena.oh70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour