Question: Custom genome for galaxy- Ensembl reference genome
0
gravatar for Xiefan Fang
4.5 years ago by
Xiefan Fang30
United States
Xiefan Fang30 wrote:

I want to perform a DEXseq analysis on alternative splicing, and it requires to map the RNA-seq data to a reference genome (zebrafish genome in my case) first. I want to use galaxy to do the tophat2 mapping using the zebrafish genome downloaded from ftp://ftp.ensembl.org/pub/release-75/fasta/danio_rerio/dna/ . There are about 80 small files in the ensembl folder. I downloaded them and catenated the files in Linux. I uploaded it to galaxy as a fasta file and used tophat2 for mapping. However, an error occur which says:

Warning: Encountered reference sequence with only gaps
Error: Reference sequence has more than 2^32-1 characters!  Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build /galaxy/run/prod/database/files/097/dataset_97644.dat genome 
Deleting "genome.3.bt2" file written during aborted indexing attempt.
Deleting "genome.4.bt2" file written during aborted indexing attempt.

[2014-06-05 10:28:33] Beginning TopHat run (v2.0.1)
-----------------------------------------------
[2014-06-05 10:28:33] Checking for Bowtie
Traceback (most recent call last):
  File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3901, in <module>
    sys.exit(main())
  File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3706, in main
    check_bowtie(params)
  File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1381, in check_bowtie
    bowtie_version = get_bowtie_version(params.bowtie2)
  File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1264, in get_bowtie_version
    bowtie_version = [int(x) for x in ver_numbers[:3]] + [int(ver_numbers[3][4:])]
IndexError: list index out of range

 

What can I do? I prefer to use ensembl genome assembly because I need to use ensembl transcriptome for annotation later. Thank you and I look forward to your answers!!

assembly galaxy • 3.3k views
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Xiefan Fang30
0
gravatar for Jennifer Hillman Jackson
4.5 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

If you are pulling all the files from this directory, then you are merging multiple versions of the genome into a single file. Pick just one. The README file here and others in directories above it describe the data contents. A "toplevell" file is almost certainly what you will want to use - and most likely the masked version that does not include all of the unassembled fragments.

And once finished, you can check content (a very good idea): - once you have the fasta. Does it contain the same sequence identifiers as the reference annotation that you wish to use? Then you have the right one. Do the identifiers vary a bit? Modifying the identifiers before using in Galaxy. They must be an exact match between all inputs for the Tuxedo pipeline to work correctly.

Hopefully this helps, Jen, Galaxy team

 

ADD COMMENTlink written 4.5 years ago by Jennifer Hillman Jackson25k

Thank you for your answer! I merged all the toplevel files by using cat command and used the merged file for Tophat2 analysis. But an error occur (please see the following). Do you know what may be the reason? Many thanks!!

 

-----------------------------------------------------------------------------

job id: 55870

tool id: tophat2

-----------------------------------------------------------------------------

job command line:

bowtie2-build "/galaxy/run/prod/database/files/098/dataset_98567.dat" genome ; ln -s "/galaxy/run/prod/database/files/098/dataset_98567.dat" genome.fa ; tophat2 --num-threads 4 -r 300 --mate-std-dev=20 genome /galaxy/run/prod/database/files/097/dataset_97704.dat /galaxy/run/prod/database/files/097/dataset_97705.dat

-----------------------------------------------------------------------------

job stderr:

Fatal error: Tool execution failed

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

Warning: Encountered reference sequence with only gaps

 

[2014-06-17 05:23:53] Beginning TopHat run (v2.0.1)

-----------------------------------------------

[2014-06-17 05:23:53] Checking for Bowtie Traceback (most recent call last):

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3901, in <module>

sys.exit(main())

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3706, in main

check_bowtie(params)

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1381, in check_bowtie

bowtie_version = get_bowtie_version(params.bowtie2)

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1264, in get_bowtie_version

bowtie_version = [int(x) for x in ver_numbers[:3]] + [int(ver_numbers[3][4:])]

IndexError: list index out of range

 

-----------------------------------------------------------------------------

job stdout:

Settings:

Output files: "genome.*.bt2"

Line rate: 6 (line is 64 bytes)

Lines per side: 1 (side is 64 bytes)

Offset rate: 4 (one in 16)

FTable chars: 10

Strings: unpacked

Max bucket size: default

Max bucket size, sqrt multiplier: default

Max bucket size, len divisor: 4

Difference-cover sample period: 1024

Endianness: little

Actual local endianness: little

Sanity checking: disabled

Assertions: disabled

Random seed: 0

Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA:

/galaxy/run/prod/database/files/098/dataset_98567.dat

Reading reference sizes

Time reading reference sizes: 00:00:54 Calculating joined length Writing header Reserving space for joined string Joining reference sequences

Time to join reference sequences: 00:00:43 bmax according to bmaxDivN setting: 865326026 Using parameters --bmax 648994520 --dcv 1024

Doing ahead-of-time memory usage test

Passed! Constructing with these parameters: --bmax 648994520 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample

Building sPrime

Building sPrimeOrder

V-Sorting samples

V-Sorting samples time: 00:05:14

Allocating rank array

Ranking v-sort output

Ranking v-sort output time: 00:00:41

Invoking Larsson-Sadakane on ranks

Invoking Larsson-Sadakane on ranks time: 00:01:42

Sanity-checking and returning

Building samples

Reserving space for 12 sample suffixes

Generating random suffixes

QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples

(Using difference cover)

Multikey QSorting samples time: 00:00:00 Calculating bucket sizes

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:44 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 7; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:02:53 Splitting and merging

Splitting and merging time: 00:00:00

Avg bucket size: 4.94472e+08 (target: 648994519) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:50

Sorting block of length 553695740

(Using difference cover)

Sorting block time: 00:28:13

Returning block of 553695741

Getting block 2 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:59

Sorting block of length 181049492

(Using difference cover)

Sorting block time: 00:09:14

Returning block of 181049493

Getting block 3 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:06

Sorting block of length 507123056

(Using difference cover)

Sorting block time: 00:25:50

Returning block of 507123057

Getting block 4 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:13

Sorting block of length 597268592

(Using difference cover)

Sorting block time: 00:31:10

Returning block of 597268593

Getting block 5 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:10

Sorting block of length 420285697

(Using difference cover)

Sorting block time: 00:21:52

Returning block of 420285698

Getting block 6 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:20

Sorting block of length 574709141

(Using difference cover)

Sorting block time: 00:29:27

Returning block of 574709142

Getting block 7 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:55

Sorting block of length 627172380

(Using difference cover)

Sorting block time: 00:32:11

Returning block of 627172381

Exited Ebwt loop

fchr[A]: 0

fchr[C]: 1096062216

fchr[G]: 1730811367

fchr[T]: 2365767559

fchr[$]: 3461304104

Exiting Ebwt::buildToDisk()

Returning from initFromVector

Wrote 1191375537 bytes to primary EBWT file: genome.1.bt2 Wrote 865326032 bytes to secondary EBWT file: genome.2.bt2 Re-opening _in1 and _in2 as input streams Returning from Ebwt constructor

Headers:

len: 3461304104

bwtLen: 3461304105

sz: 865326026

bwtSz: 865326027

lineRate: 6

offRate: 4

offMask: 0xfffffff0

ftabChars: 10

eftabLen: 20

eftabSz: 80

ftabLen: 1048577

ftabSz: 4194308

offsLen: 216331507

offsSz: 865326028

lineSz: 64

sideSz: 64

sideBwtSz: 48

sideBwtLen: 192

numSides: 18027626

numLines: 18027626

ebwtTotLen: 1153768064

ebwtTotSz: 1153768064

color: 0

reverse: 0

Total time for call to driver() for forward index: 03:31:40 Reading reference sizes

Time reading reference sizes: 00:00:39 Calculating joined length Writing header Reserving space for joined string Joining reference sequences

Time to join reference sequences: 00:00:44

Time to reverse reference sequence: 00:00:09 bmax according to bmaxDivN setting: 865326026 Using parameters --bmax 648994520 --dcv 1024

Doing ahead-of-time memory usage test

Passed! Constructing with these parameters: --bmax 648994520 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample

Building sPrime

Building sPrimeOrder

V-Sorting samples

V-Sorting samples time: 00:05:15

Allocating rank array

Ranking v-sort output

Ranking v-sort output time: 00:00:41

Invoking Larsson-Sadakane on ranks

Invoking Larsson-Sadakane on ranks time: 00:01:42

Sanity-checking and returning

Building samples

Reserving space for 12 sample suffixes

Generating random suffixes

QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples

(Using difference cover)

Multikey QSorting samples time: 00:00:00 Calculating bucket sizes

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:52 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 5; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:18 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 0; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:33 Splitting and merging

Splitting and merging time: 00:00:00

Avg bucket size: 3.84589e+08 (target: 648994519) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:44

Sorting block of length 253823515

(Using difference cover)

Sorting block time: 00:12:44

Returning block of 253823516

Getting block 2 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:58

Sorting block of length 584985421

(Using difference cover)

Sorting block time: 00:30:08

Returning block of 584985422

Getting block 3 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:03

Sorting block of length 178374514

(Using difference cover)

Sorting block time: 00:08:56

Returning block of 178374515

Getting block 4 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:11

Sorting block of length 545802700

(Using difference cover)

Sorting block time: 00:28:17

Returning block of 545802701

Getting block 5 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:13

Sorting block of length 472723375

(Using difference cover)

Sorting block time: 00:24:37

Returning block of 472723376

Getting block 6 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:16

Sorting block of length 576540110

(Using difference cover)

Sorting block time: 00:29:43

Returning block of 576540111

Getting block 7 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:14

Sorting block of length 176226007

(Using difference cover)

Sorting block time: 00:08:55

Returning block of 176226008

Getting block 8 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:20

Sorting block of length 531547336

(Using difference cover)

Sorting block time: 00:27:36

Returning block of 531547337

Getting block 9 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:46

Sorting block of length 141281118

(Using difference cover)

Sorting block time: 00:07:27

Returning block of 141281119

Exited Ebwt loop

fchr[A]: 0

fchr[C]: 1096062216

fchr[G]: 1730811367

fchr[T]: 2365767559

fchr[$]: 3461304104

Exiting Ebwt::buildToDisk()

Returning from initFromVector

Wrote 1191375537 bytes to primary EBWT file: genome.rev.1.bt2 Wrote 865326032 bytes to secondary EBWT file: genome.rev.2.bt2 Re-opening _in1 and _in2 as input streams Returning from Ebwt constructor

Headers:

len: 3461304104

bwtLen: 3461304105

sz: 865326026

bwtSz: 865326027

lineRate: 6

offRate: 4

offMask: 0xfffffff0

ftabChars: 10

eftabLen: 20

eftabSz: 80

ftabLen: 1048577

ftabSz: 4194308

offsLen: 216331507

offsSz: 865326028

lineSz: 64

sideSz: 64

sideBwtSz: 48

sideBwtLen: 192

numSides: 18027626

numLines: 18027626

ebwtTotLen: 1153768064

ebwtTotSz: 1153768064

color: 0

reverse: 1

Total time for backward call to driver() for mirror index: 03:38:37

ADD REPLYlink written 4.5 years ago by Xiefan Fang30

Thank you for your answer! I merged all the toplevel files by using cat command and used the merged file for Tophat2 analysis. But an error occur (please see the following). Do you know what may be the reason? Many thanks!!

ADD REPLYlink written 4.5 years ago by Xiefan Fang30

The warnings indicate that a sequence comprised of only "N"s is in the dataset. Whether this is true or not, can be checked then removed (and reported back to the source if found). There are plenty of genomes used and indexed here that lead off with large strings of "N" content that do not produce this warning, so that is not the issue. 

But - I would confirm "fasta" format before moving forward. Improper format could be triggering the error. In particular, make sure there are no extra lines or spaces and that the lines are wrapped. Here is some help:
https://wiki.galaxyproject.org/Support#Error_from_tools
https://wiki.galaxyproject.org/Learn/Datatypes#Fasta
https://wiki.galaxyproject.org/Support#Custom_reference_genome

Jen, Galaxy team
 

ADD REPLYlink written 4.5 years ago by Jennifer Hillman Jackson25k
0
gravatar for Xiefan Fang
4.5 years ago by
Xiefan Fang30
United States
Xiefan Fang30 wrote:

duplicate with comment

ADD COMMENTlink modified 4.5 years ago by Jennifer Hillman Jackson25k • written 4.5 years ago by Xiefan Fang30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 170 users visited in the last hour