Question: Inquiry On Fastqc Report
5.0 years ago by
Dear Galaxy Officer, Good day. I am a new user of Galaxy main server. The tools provided are very user-friendly. Thanks for the establishment of these. I just new to the RNA-seq analysis and now in the learning process of Bioinformatics. I would like to inquire on the FastaQC report generated on my data. For your information: Samples: Plant (dicotyledon) Type of data: RNA-seq (Illumina HiSeq 2000 with CASAVA v 1.8.2) Paired ends Adapter sequence: RPI 15 ( 5’ CAAGCAGAAGACGGCATACGAGATTGACATGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA) Main purpose of my analysis: Identification of novel transcript and gene expression studies I run FastQC on my raw RNA-seq data both forward and reverse. I attach the FastQC report in this email. My questions are: 1) The basic statistics shows that my data encoding is Sanger/illumina 1.9. When I grooming my data for downstream analysis in Galaxy, is that correct I choose "Sanger" for the input FASTQ quality score type? 2) Based on the per base sequence quality, the quality scores are above 20.0 for both forward and reverse data. Do I still need to trim off my data? 3) The result for "Per base sequence content", "Per base GC content", "sequence duplication level" are fail. What are these three results indicate? What are the solution for these problems? 4) What the overrepresented sequence indicate? Do I need to trim off the overrepresented sequence? 5) Based on the K-mer content, how could I analyse and justify whether this is good data or not? 6) In the reverse data FastQC report, "per sequence GC content" seem not good. What do this indicate? 7) How could I identify the adapter sequence in my RNA-seq data and how could could I remove? 8) After grooming data, running FastQC on data, adapter removal, is there any other pre-processing steps need to be done before running bowtie and top hat? Many Thanks in advance for your kind assistance and supports. Best regards Ng Kiaw Kiaw PhD student RIKEN Yokohama Campus Japan.
alignment bowtie • 4.3k views
4.9 years ago by
United States
Hello, Your post is very difficult to read with the formatting. The best place to find out more about the FastQC program is through the tool documentation, linked from the tool form but also here: More below. Yes, if you choose to groom, Sanger is the correct input. Or you can just assign the datatype to .fastqsanger by clicking on the pencil icon. More help is in this screencast "FASTQ Prep - Illumina" No, most likely not, this is a reasonable quality score to use as a baseline. These are quality metrics and indicate that the data is skewed away from what would be expected in a normal distribution. You could investigate the library preparation methods is this is your own data. Same as above. And yes, if it is a great portion of your data, repetitive, or causes problem later on, as it effectively "shortens" the length of the sequence being aligned, even though the sequence is longer - and this could cause you to pick the wrong length parameters in Tophat. Same as above. Same as above. Locating the methods associated with the preparation of the data is the first place to look. You could also just trim the reads if the "overrepresented sequence" is localized to where the adapter is most likely to be, then trim based off of that range. Because quality is not an issue, no trimming is necessary. You could however filter out short sequences that will never be able to meet the alignment criteria. See the Tophat documentation about how to best tune parameters to match data based on the length of reads. All of this said, most of the time, very little needs to be done most of the time. Poor reads will simply fall out and not align in the first steps of the pipeline. Trimming and setting Tophat parameters will have the greatest impact. Take care, Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training
