Hi, I have RNA-seq data and I am interested in whole gene expression results but also transcript expression, exon usage and long non coding RNA. The RNA-seq protocol is not optimized for circular and fusion RNA but I also want to look at that just in case I would find something giving me some ideas for future experiments. Below I am giving all my workflow for trimming and mapping in detail.
My problem is that I get errors trying to assemble mapped reads using StringTie. I use the same gencode annotation file that I included for mapping with STAR.
I used the following options in StringTie:
Use GFF file to guide assembly yes
Reference annotation to use for guiding the assembly process 11: gencode.v27.primary_assembly.annotation.gff3
Perform abundance estimation only of input transcripts False
Output additional files for use in... deseq2
Average read length 99
Whether to cluster genes that overlap with different gene IDs False
Options advanced
Disable trimming of predicted transcripts based on coverage False
Increase sensitivity False
Name prefix for output transcripts STRG
Minimum isoform fraction 0.15
Minimum assembled transcript length 35
Minimum anchor length for junctions 10
Minimum junction coverage 5
Minimum bundle reads per bp coverage to consider for assembly 2
Gap between read mappings triggering a new bundle 50
Fraction of bundle allowed to be covered by multi-hit reads 0.95
Do not assemble any transcripts on these reference sequence(s) Empty.
Additional gene abundance estimation output file True
Disable multi-mapping correction False
Job Resource Parameters no
However, I keep getting the follwoing error after certain minutes: Fatal error: Exit code 1 () Traceback (most recent call last): File "/galaxy/main/deps/_conda/envs/__stringtie@1.3.3/bin/prepDE.py", line 85, in <module> t_id=RE_TRANSCRIPT_ID.search(v[len(v)-1]).group(1) AttributeError: 'NoneType' object has no attribute 'group'
Can you tell me, what the problem might be or any suggestions what I may have to change?
**Thank you very much!!!
Best regards Stefanie**
More detail on trimming and mapping:
Trimming was performed by removing 5' and 3' ends, if quality was lower than 10 and also reads smaller than 20bp have been removed.
I mapped my RNA-seq data with STAR using gencode primary assembly reference sequence and annotation with the following options:
Single-end or paired-end reads single
RNA-Seq FASTQ/FASTA file 7: Trimmomatic on CON2.fastq
Custom or built-in reference genome history
Select a reference genome 8: GRCh38.primary_assembly.genome.fa
Gene model (gff3,gtf) file for splice junctions 9: gencode.v27.primary_assembly.annotation.gff3
Length of the genomic sequence around annotated junctions 99
Count number of reads per gene True
Would you like to set output parameters (formatting and filtering)? yes
Extra SAM attributes to include All
Include strand field flag XS Yes -- and reads with inconsistent and/or non-canonical introns are filtered out
Filter alignments containing non-canonical junctions Remove alignments with unannotated non-canonical junctions
Would you like to set additional output parameters (formatting and filtering)? yes
Would you like unmapped reads included in the SAM? False
Would you like all alignments with the best score labeled primary? False
MAPQ value for unique mappers 255
Would you like to keep only reads that contain junctions that passed filtering? False
Score range below the maximum score for multimapping alignments 1
Maximum number of alignments to output a read's alignment results, plus 1 10
Maximum number of mismatches to output an alignment, plus 1 10
Maximum ratio of mismatches to mapped length 0.2
Maximum ratio of mismatches to read length 1.0
Minimum alignment score 0
Minimum alignment score, normalized to read length 0.7
Minimum number of matched bases 18
Minimum number of matched bases, normalized to read length 0.7
Other parameters (seed, alignment, and chimeric alignment) star_fusion
Job Resource Parameters no
All mapped reads were in sorted bam files (sorted by coordinates). The mapping statistics gives me a mapping of almost 100%??? Are my mapping options not stringent enough? Here are the mapping results as detailed in star log file:
Number of input reads | 60296901
Average input read length | 100
UNIQUE READS:
Uniquely mapped reads number | 53290337
Uniquely mapped reads % | 88.38%
Average mapped length | 100.37
Number of splices: Total | 15721797
Number of splices: Annotated (sjdb) | 15589504
Number of splices: GT/AG | 15590195
Number of splices: GC/AG | 111954
Number of splices: AT/AC | 12448
Number of splices: Non-canonical | 7200
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.01%
Deletion average length | 1.83
Insertion rate per base | 0.01%
Insertion average length | 1.34
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 5825990
% of reads mapped to multiple loci | 9.66%
Number of reads mapped to too many loci | 22883
% of reads mapped to too many loci | 0.04%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.89%
% of reads unmapped: other | 0.03%
CHIMERIC READS:
Number of chimeric reads | 533492
% of chimeric reads | 0.88%