I am new to RNAseq data and learning this process step by step, so I have a few questions (I highlighted the questions below, so they are easily visible after reading the whole post). I have a set of RNAseq data that I recently got from a sequencing facility of choice. It is a paired ends data for 3 treatment groups with 3 biological samples each. The end goal of this experiment is to get an assembled transcriptome of the tissue of interest and to perform transcript quantification.
Firstly, I intend to use two workflows for mapping the reads:
(i) Trinity RNAseq protocol using Galaxy Trinity instance https://galaxy.ncgas-trinity.indiana.edu --> RSEM, EdgeR, etc (ii) map against a custom reference genome by uploading the genome files (?) and using Bowtie2 to map, and StringTie and CuffDiff to analyze abundance
FYI: for my species of interest, there is a newly released genome with the following files: (1) scaffold and (2) contig genome assemblies; (3) GFF, (4) gene annotation, (5) transcripts and (6) peptides. Which of these files will I need to map to this first genome release?
Do the workflows above sound good to you?
I have started QC analysis and ran fastQC on my original datasets and it flagged several things: - duplication levels =fail (realized it's not a problem from reading about quantitative experiments, the consensus is that duplicated sequences should not be removed for quantitative experiments) - per base sequence content - all R2 reads failed, all R1 reads - warning (are these adapters?) - per sequence GC content (the GC content of the released genome is ~35%; the plot of my data sets has "small shoulders" towards the middle and a spike around ~30% mark) - kmer content also failed in the beginning base positions and sometimes spiked up in the middle of reads
So, I have just run Trimmomatic on these data sets. It produced 4 files - unpaired and paired files from both R1 and R2. Do I need to retain all files for downstream applications like mapping reads and transcript quantification, or can I just stick to the 2 paired reads files? Also, can I delete the original 10Gb files from Galaxy (I have them backed up on FTP and a local hard drive)?
I reran the fastQC after Trimmomatic on one of my files and it doesn't seem to have changed the results that much. Is this normal for the parameters that failed?
I've also noticed that some reads start with an N even after Trimmomatic, but always in the first position - was it not supposed to remove N's? Should I just trim the first nucleotide from those data sets?
Additionally, for the Trinity workflow, the only available QC tool on their Galaxy instance is the fastQ Quality Trimmer - I used it to trim 5' and 3' ends with a window size of 4, step-size 1, max # to exclude 0, min score >= 20. After putting my datasets through it, the files are not shrinking by much, so I'm assuming the data are good quality, or am I not restricting parameters far enough?
Do I need to process the raw reads in any other way besides the two methods I used above or can I map them now?
Apologies for the long post, just wanted to be as descriptive as possible.
Thank you in advance for your help!
Best regards, Dennis