4.5 years ago by
United States
Hi Araujo,
For QA steps, definitely run the tool "FastQC" first on uploaded datasets, as Bjoern suggests. This is good for a few reasons:
- This will help with making sure that the correct quality score scaling is adjusted (if needed) and with deciding how to trim. This tool and others used to prep data are in the tool group "NGS: QC and manipulation" or use the tool search at the top of the left tool panel if you know the tool name.
- Sometimes you will want to run this tool twice - once on the dataset (or a sample of it, to speed up processing/use less space a sample can be enough) to detect and adjust quality scores, as described here: https://wiki.galaxyproject.org/Support#Dataset_special_cases This is an important first step for any newly uploaded fastq dataset. Then, if you do run the tool "Fastq Groomer" to resale the quality scores, run "FastQC'" again on the entire new dataset for further QC.
You may be aware of this, but it seems worth mentioning while on the subject of QA/QC. How much you want to do in terms of trimming or filtering on quality or on mapping result status will depend on the type of downstream analysis you plan to do.
- If proceeding with an expression analysis workflow (Tophat, Cufflinks, etc), then the less you do to alter the data beyond basic artifact removal is often better, as you'll map more data and avoid skewing results - the tuxedo pipeline on usegalaxy.org under the tool group "NGS:RNA-Seq" will perform filtering for you (some is built-it, other are tool options on the forms).
- But if you are plan on performing a workflow that involves variant calling, a bit more QC to use the highest quality sequence in the beginning (more aggressive quality trimming and potential low-quality sequence filtering) and later filtering for properly mapped matched pair ends are common choices before doing to the calling (in addition to setting tool form options to screen for statically significant variants, plus some tools are more sensitive than others by default).
We have an updated tutorial for these exact workflows in progress (Dan and I), and GCC this year will include a session on similar content during Training Day (Tom Bair from Univ. of Iowa and I), and those resources will be on usegalaxy.org as Page (with included workflows, datasets, histories), linked to our Learn & Support wikis, plus a short version (linked to the full tutorial) will be placed under 'Tutorials' here on Galaxy Biostar > within the next few weeks. But for right now, others from the community and our team have related tutorials available, should these be of interested (to you, or others reading this post). See our Learn wiki resources plus the RNA-seq wiki hub for the links:
https://wiki.galaxyproject.org/Support#Learning_Hub
https://wiki.galaxyproject.org/Support#Tools_on_the_Main_server:_RNA-seq
Thanks! Jen, Galaxy team
Have you run FASTQC on your dataset? FASTQC will tell you if you need to trim your sequences or if you have contaminations.
Ciao,
Bjoern