Low properly paired rate after TopHat

Question: Low properly paired rate after TopHat

3.8 years ago by

hsuehyuanc • 0

United States

hsuehyuanc • 0 wrote:

Hi all,

I have a very low 436924 + 0 properly paired (0.73%:-nan%)

I think it is because my paired data was changed after I trimed the low quality reads.

In the past, I use Galaxy interlacer and de-interlacer function to pair the data again.

I wonder how to do this using command lines.

Thank you,

I attached the output after running flagstat:

samtools flagstat accepted_hits.bam

60237435 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
60237435 + 0 mapped (100.00%:-nan%)
59744051 + 0 paired in sequencing
31769966 + 0 read1
27974085 + 0 read2
436924 + 0 properly paired (0.73%:-nan%)
47258258 + 0 with itself and mate mapped
12485793 + 0 singletons (20.90%:-nan%)
47220958 + 0 with mate mapped to a different chr
47220958 + 0 with mate mapped to a different chr (mapQ>=5)

rna-seq • 2.9k views

ADD COMMENT • link •

modified 3.8 years ago • written 3.8 years ago by hsuehyuanc • 0

3.8 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

I am assuming that you are working on the public Main Galaxy server at http://usegalaxy.org. Although most of this advice could apply when running within Galaxy anywhere (or even command-line).

The order of the sequence data is not a factor with this tool. And to let you know, if you use Tophat2, the first output dataset will provide additional statistics directly from the tool. Perhaps try this if you are using "Tophat for Illumina" currently?

Performing too much QA/QC can impact paired rates when there is not enough of the sequence data left that meets the mapping criteria set on the Tophat/Tophat2 tool form. Double check the parameters against the sequences content (length, quality). In particular, I am wondering if the option "Minimum length of read segments:" is set too high. This is found under "full parameters". The value needs to be no less that one half of the total length of the reads in order to map without bias. This is true for Tophat and Tophat2 (it is not clear which you are using). The other parameters can be reviewed in the tool's 3rd party documentation - the link is on the tool form.

Other items to consider:

1. Are you certain that the sequences have correctly scaled quality scores and the appropriate "datatype" assigned? This generally presents as lower mapping rates and pairs. You want the data in "fastqsanger" format. This wiki explains how to check, rescale if necessary, and set the metadata:
http://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA

2. Perhaps try a run with data that has less aggressive trimming (and other manipulations) done prior to mapping. Just trimming at a sequence quality of 20 is good place to start and run a test mapping job. The goal with this tool set is to get as much of the data mapped as possible. Less manipulation is generally better.

Best, Jen, Galaxy team

ADD COMMENT • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you Jen,

Now I understand the low properly paired rate is not due to trim (I only remove low complexity, quality <20, read length <30).

my first first TopHat file prereads.info) is :

left_min_read_len=30
left_max_read_len=101
left_reads_in =37621733
left_reads_out=37617949
right_min_read_len=30
right_max_read_len=101
right_reads_in =37621733
right_reads_out=37609423
unpaired_min_read_len=30
unpaired_max_read_len=101
unpaired_reads_in =598371
unpaired_reads_out=598313

My RNA sample is Apple.

I ran several times of TopHat with different options, the low properly paired rate is always a problem.

I provide my command and output summary below, Could you tell whats wrong by the information?

1. Use default options

tophat Malus_x_domestica.v1.0.contigs.fa HC0W1_R1_qc.fastq HC0W1_R2_qc.fastq

Align summary

Left reads:
Input : 37621733
Mapped : 32815374 (87.2% of input)
of these: 12760216 (38.9%) have multiple alignments (65212 have >20)
Right reads:
Input : 37621733
Mapped : 29061975 (77.2% of input)
of these: 11851890 (40.8%) have multiple alignments (65223 have >20)
Unpaired reads:
Input : 598371
Mapped : 510248 (85.3% of input)
of these: 124732 (24.4%) have multiple alignments (0 have >20)
82.3% overall read mapping rate.

Aligned pairs: 25354952
of these: 10935386 (43.1%) have multiple alignments
12675502 (50.0%) are discordant alignments
33.7% concordant pair alignment rate.

Flagstat

112828042 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
112828042 + 0 mapped (100.00%:-nan%)
112115232 + 0 paired in sequencing
58659254 + 0 read1
53455978 + 0 read2
1087302 + 0 properly paired (0.97%:-nan%)
96492240 + 0 with itself and mate mapped
15622992 + 0 singletons (13.93%:-nan%)
96435938 + 0 with mate mapped to a different chr
28804504 + 0 with mate mapped to a different chr (mapQ>=5)

2. With some general options

tophat -a 5 -m 1 -i 40 -I 5000 -p 4 -g 1 --microexon-search --coverage-search Malus_x_domestica.v1.0.contigs.fa HC0W1_R1_qc.fastq HC0W1_R2_qc.fastq

Align summary

Left reads:
Input : 37621733
Mapped : 31765262 (84.4% of input)
of these: 2663386 ( 8.4%) have multiple alignments (3141938 have >1)
Right reads:
Input : 37621733
Mapped : 27970159 (74.3% of input)
of these: 2663386 ( 9.5%) have multiple alignments (2920135 have >1)
Unpaired reads:
Input : 598371
Mapped : 493311 (82.4% of input)
79.4% overall read mapping rate.

Aligned pairs: 23622294
of these: 2663386 (11.3%) have multiple alignments
19558253 (82.8%) are discordant alignments
10.8% concordant pair alignment rate.

Flagstat

60228732 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
60228732 + 0 mapped (100.00%:-nan%)
59735421 + 0 paired in sequencing
31765262 + 0 read1
27970159 + 0 read2
436702 + 0 properly paired (0.73%:-nan%)
47244588 + 0 with itself and mate mapped
12490833 + 0 singletons (20.91%:-nan%)
47207272 + 0 with mate mapped to a different chr
47207272 + 0 with mate mapped to a different chr (mapQ>=5)

Thank you,

ADD REPLY • link written 3.8 years ago by hsuehyuanc • 0

Hello, It could be that your reference genome is overly fragmented. Meaning, the contigs are not long enough to capture full gene bounds, so ends of pairs are not aligning to the same "chromosome/contig".

Also, I didn't see any modification to the parameter ""Minimum length of read segments:":

--segment-length                  Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.

If your sequences are 30, then this should be set to 15. Be aware that this change will cause the job to consume more memory.

Best, Jen, Galaxy team

ADD REPLY • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen,

I found that using interlacer/ de-interlacer can solve my problem.

I still do not know how to write a command line for interlacer, but I use galaxy platform for this purpose.

These are flagstat outputs.

No trimmed pair-end reads
83951898 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
83951898 + 0 mapped (100.00%:-nan%)
83951898 + 0 paired in sequencing
39375726 + 0 read1
44576172 + 0 read2
64554028 + 0 properly paired (76.89%:-nan%)
71047040 + 0 with itself and mate mapped
12904858 + 0 singletons (15.37%:-nan%)
4776704 + 0 with mate mapped to a different chr
666928 + 0 with mate mapped to a different chr (mapQ>=5)

Trimmed paired-end reads

116823170 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
116823170 + 0 mapped (100.00%:-nan%)
116823170 + 0 paired in sequencing
61256423 + 0 read1
55566747 + 0 read2
2264624 + 0 properly paired (1.94%:-nan%)
100599138 + 0 with itself and mate mapped
16224032 + 0 singletons (13.89%:-nan%)
100467510 + 0 with mate mapped to a different chr
26523106 + 0 with mate mapped to a different chr (mapQ>=5)

After interlacer/de-interlacer of the trimmed reads

80413866 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
80413866 + 0 mapped (100.00%:-nan%)
80413866 + 0 paired in sequencing
43115217 + 0 read1
37298649 + 0 read2
62323310 + 0 properly paired (77.50%:-nan%)
66791410 + 0 with itself and mate mapped
13622456 + 0 singletons (16.94%:-nan%)
3117332 + 0 with mate mapped to a different chr
633558 + 0 with mate mapped to a different chr (mapQ>=5)

Could you provide me method to write command lines to run interlacer? what module should I load?

I have searched this for a long time and cannot find answers.

Thank you,

ADD REPLY • link written 3.8 years ago by hsuehyuanc • 0

Hum. The numbers do not appear to be the same datasets, but I don't know those details.

To learn the command-line of most tools, you can look on the "Info" page found by clicking on the "i" icon for a dataset produced by the tool. Most are there - but if it is not, run the job in a way that causes the job to fail, then send the bug report to yourself. The command-line is almost always present. You can also look in the server log for the error, but this can be easier.

Thanks, Jen, Galaxy team

ADD REPLY • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen,

I found the closest info to command line is, is this the right information I need? How to use it?

Galaxy Tool ID:

toolshed.g2.bx.psu.edu/repos/devteam/fastq_paired_end_interlacer/fastq_paired_end_interlacer/1.1

ADD REPLY • link written 3.8 years ago by hsuehyuanc • 0

The command is further down on the info page, the section is called "Job Command-Line". For this tool, it will look something like this, but with the paths/dataset names specific to your system:

python /path_to_tool/tool.py 'path_to_first_input_dataset' 'sanger' 'path_to_second_input_dataset' 'sanger' 'path_to_first_output_dataset' 'path_to_second_output_dataset'

The general pattern for all tools is this:

python tool input1 parameter(s) output1

Hopefully this helps, Jen, Galaxy team

ADD REPLY • link written 3.8 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »