FastQC doesn't work after Splitter

Question: FastQC doesn't work after Splitter

4.5 years ago by

Netherlands

Hello,

I am working with paired-end reads, in separate files, and as far as I understood these should be joined before filtering. So I joined them, then filtered by quality, then split them with Splitter. Now I would like to run FastQC again and perhaps trim the sequences before aligning. But when I try to launch FastQC on Splitter output, the jobs become paused (blue, pause sign) and don't move on.

Could somebody help me out/explain what is wrong?

Best regards,

Monika

rna-seq fastq manipulation • 1.1k views

ADD COMMENT • link •

modified 4.5 years ago by Jennifer Hillman Jackson ♦ 25k • written 4.5 years ago by m.maleszewska • 20

4.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Monika,

You are working on the Main (usegalaxy.org) public server?

Joining and splitting is not always necessary or even recommended. Often it is better to do the data prep (grooming if needed, trimming, etc), map, and then filter after for properly paired reads if you are proceeding with a variant workflow. If using an expression workflow, no filtering is necessary (the Tuxedo pipeline will filter for you) - the less done to the data outside of basic QA/QC before mapping with Tophat/2, or running Cufflinks and downstream tools, the better.

The blue "pause" state indicates that something is incorrect, likely with the metadata, with the input files to the FastQC tool. If you open the datasets, does a message in the dataset appear indicating this? Clicking to auto-detect, as prompted, can often repair the problem. This assumes that the datasets are not empty - which can sometimes occur after certain jobs running a workflow and the data is not checked to ensure that it passed that step (either the tool setting are incomplete/incorrect or the data simply did not pass the criteria set).

Also, it seems unlikely that the data wouldn't be set to a datatype that is of a "fastq" or "fasta" variety after the actions you describe, but that could also present with this problem. FastQC does not require ".fastqsanger" format, but you will want the quality scores scaled to that format (which means that you might as well assign that datatype) before running the tool on the entire dataset, unless the intention is to determine the quality score type (to determine if grooming is needed).
This wiki section describes the QC process for quality score determination/assignment:
https://wiki.galaxyproject.org/Support#Dataset_special_cases

Give the inputs a look and if you are unable to solve the problem, please share the history. Create a share link and paste it into an email sent to galaxy-bugs@bx.psu.edu. Make sure to include either your galaxy account's email address or undelete/unhide all datasets in the analysis path. How to share is described here: https://wiki.galaxyproject.org/Learn/Share

Best, Jen, Galaxy team

ADD COMMENT • link written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

4.5 years ago by

m.maleszewska • 20

Netherlands

m.maleszewska • 20 wrote:

Dear Jen,

Thank you, this is very helpful!

I would like to follow the differential expression pipeline (Tuxedo). It is good to learn now that the filtering upfront is not necessary, I have not yet proceeded with the alignments. I wondered, if working with Illumina-derived reads, should I use the 'TopHat for Illumina' or 'TopHat2', and does it make much of a difference?

For my 'Paused jobs' issue, I realised later that I used 100% of space, and that was why the jobs did not proceed - it showed only 50%, and it wasn't until I opened Galaxy on another computer, when the status changed and I understood what happened... I'm sorry! But thank you for the explanation, I think it will be useful in case I /sb runs across this problem again.

By the way, how can I permanently delete some datasets to free up some space? It seems that regular deleting from the histories does not really remove them.

EDIT: I have found the way to remove them permanently (by Including the deleted datasets in the history from the History menu, and then removing permanently from the disc.)

Best regards,

Monika

ADD COMMENT • link modified 4.5 years ago • written 4.5 years ago by m.maleszewska • 20

4.5 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Monika,

Glad that you were able to figure out the problem. Permanently deleting data is the only way to reduce disk usage. You can remove individual datasets or entire histories. The first wiki explains quotas & jobs, the other the delete process:
https://wiki.galaxyproject.org/Main
https://wiki.galaxyproject.org/Learn/ManagingDatasets#Delete_vs_Delete_Permanently

Clicking to refresh the view is a good idea whenever Galaxy has been open a while or the view doesn't make sense (example: jobs may appear to still be running). "Reload" the browser window (this can vary by browser) or click on the large "Galaxy" text icon in the upper left corner. Refreshing an individual history can be done by clicking on the small "double circled arrow" icon at the top of the history panel.

Tophat2 has more features, but this is your choice. These are the tool author's notes about the differences in the updated release:
http://tophat.cbcb.umd.edu/index.shtml

Take care, Jen, Galaxy team

ADD COMMENT • link modified 4.5 years ago • written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »