RNAseq data to be processed in two ways: (i) mapping to de novo Trinity-based transcriptome and (ii) mapping a relatively new genome

Question: RNAseq data to be processed in two ways: (i) mapping to de novo Trinity-based transcriptome and (ii) mapping a relatively new genome

13 months ago by

Dennis • 10

Dennis • 10 wrote:

Hello all,

I am new to RNAseq data and learning this process step by step, so I have a few questions (I highlighted the questions below, so they are easily visible after reading the whole post). I have a set of RNAseq data that I recently got from a sequencing facility of choice. It is a paired ends data for 3 treatment groups with 3 biological samples each. The end goal of this experiment is to get an assembled transcriptome of the tissue of interest and to perform transcript quantification.

Firstly, I intend to use two workflows for mapping the reads:

(i) Trinity RNAseq protocol using Galaxy Trinity instance https://galaxy.ncgas-trinity.indiana.edu --> RSEM, EdgeR, etc (ii) map against a custom reference genome by uploading the genome files (?) and using Bowtie2 to map, and StringTie and CuffDiff to analyze abundance

FYI: for my species of interest, there is a newly released genome with the following files: (1) scaffold and (2) contig genome assemblies; (3) GFF, (4) gene annotation, (5) transcripts and (6) peptides. Which of these files will I need to map to this first genome release?

Do the workflows above sound good to you?

I have started QC analysis and ran fastQC on my original datasets and it flagged several things: - duplication levels =fail (realized it's not a problem from reading about quantitative experiments, the consensus is that duplicated sequences should not be removed for quantitative experiments) - per base sequence content - all R2 reads failed, all R1 reads - warning (are these adapters?) - per sequence GC content (the GC content of the released genome is ~35%; the plot of my data sets has "small shoulders" towards the middle and a spike around ~30% mark) - kmer content also failed in the beginning base positions and sometimes spiked up in the middle of reads

So, I have just run Trimmomatic on these data sets. It produced 4 files - unpaired and paired files from both R1 and R2. Do I need to retain all files for downstream applications like mapping reads and transcript quantification, or can I just stick to the 2 paired reads files? Also, can I delete the original 10Gb files from Galaxy (I have them backed up on FTP and a local hard drive)?

I reran the fastQC after Trimmomatic on one of my files and it doesn't seem to have changed the results that much. Is this normal for the parameters that failed?

I've also noticed that some reads start with an N even after Trimmomatic, but always in the first position - was it not supposed to remove N's? Should I just trim the first nucleotide from those data sets?

Additionally, for the Trinity workflow, the only available QC tool on their Galaxy instance is the fastQ Quality Trimmer - I used it to trim 5' and 3' ends with a window size of 4, step-size 1, max # to exclude 0, min score >= 20. After putting my datasets through it, the files are not shrinking by much, so I'm assuming the data are good quality, or am I not restricting parameters far enough?

Do I need to process the raw reads in any other way besides the two methods I used above or can I map them now?

Apologies for the long post, just wanted to be as descriptive as possible.

Thank you in advance for your help!

Best regards, Dennis

rna-seq • 1.5k views

ADD COMMENT • link •

modified 13 months ago • written 13 months ago by Dennis • 10

13 months ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi,

The RNA-seq tutorials here cover the current best-practices for the analysis you describe. Perhaps compare and decide the optimal workflow for your purposes using that information as a guide? https://galaxyproject.org/learn/

Tutorials cover QA/QC of reads (including interpreting and acting on FastQC results), differential expression, and assembly. Start there, I think you'll find many of your questions answered (however, some are judgment calls you will need to make).

For the part about retaining data, I would definitely recommend saving any intermediate datasets (download) in case you need them later, but just the actual inputs need to be in the history for tool or workflow use.

Thanks! Jen, Galaxy team

ADD COMMENT • link written 13 months ago by Jennifer Hillman Jackson ♦ 25k

13 months ago by

Dennis • 10

Dennis • 10 wrote:

Thank you Jennifer!

I have browsed through the tutorials and am making my way through them slowly. I can find examples of "good" and "bad" data sets with fastQC online, but no indication of what to do when certain parameters fail. Like, for example per sequence base content and kmer content failing is indicative of? Adapter contamination? (subsequent analysis in fastQC, however, does not indicate adapter contamination).

Thank you, Dennis

ADD COMMENT • link written 13 months ago by Dennis • 10

More help:

FastQC website https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Base content explains what spikes in the data indicate (related to Kmers often). The majority of issues have to do with the type of data was sequenced and the lab techniques used. RNA-seq mappers handle several of these biases without a problem (they were designed to work with this specific data). https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html
Kmers some spikes are expected for RNA-seq data, but these can also indicate overrepresented sequence. If overrepresented, that can mean that the library construction/sequencing quality was of lower quality or the sequencing was targetted (deep sequencing of a particular feature/region). The web page has many more details: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html

In summary, many of the stats indicate quality problems that can be addressed (Trimmomatic and other QC tools in that same tools group). Other indicate inherit issues or "features" with the sequence type/method or overall quality that cannot be remedied after the fact through bioinformatics QA/QC, but that you may be able to workaround during mapping and other downstream steps (or at least be aware of during interpretation of the analysis results). Some might require feedback with the sequencing lab so that protocols can be modified to produce a higher quality of data going forward (improved library prep, sequencing protocols). Do what QC you can, try mapping and see what happens, adjust mapping parameters to see if you can get a better result, and the like.

Also be aware that FastQC only reviews the first 200k sequences. For certain sequencing methods, these first sequences can have a different quality profile than latter sequences for technical reasons. You might want to slice up your data and run FastQC on a different subset of sequences to compare and get an idea of what the rest of the data looks like.

Hope that helps! There is also much discussion online (apart from our tutorials - different analysis have different QA/QC processes/data requirements, so look at those that address your specific analysis as well as the generalized QC tutorial). For example, a google search with "kmer fastqc" brings up topics like this one: https://www.biostars.org/p/172860/. When reviewing data scientifically, you'll want the big picture to make your decisions, and reviewing example discussions that others have had can provide insight. There isn't a single definitive way to address all data concerns but starting with advice from what others are doing will help you to learn about the options, what to be concerned about, and what you can ignore or is expected.

ADD REPLY • link modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jen! This is wonderful - I'll walk through the QC the best I can and try mapping. I also realised that I can use Trim Galore to specify adapter sequences to be removed - I looked it up and my sequencing facility provided me with their sequence.

Last question - what is the difference between TruSeq3 (Pair-ended, for Mi-Seq and Hi-Seq) and TruSeq3 (additional sets) (Pair-ended, for Mi-Seq and Hi-Seq) options in Trimmomatic during ILLUMINACLIP?

ADD REPLY • link written 13 months ago by Dennis • 10

Glad that helped. Please see here for Trimmomatic options:

In short, a different set of built-in data is used to do that match and trim (I am pretty sure the latter includes all of the first and includes more).

ADD REPLY • link modified 13 months ago • written 13 months ago by Jennifer Hillman Jackson ♦ 25k

Thank you very much Jen!

ADD REPLY • link written 13 months ago by Dennis • 10

Please log in to add an answer.

Similar posts • Search »