Custom genome for galaxy- Ensembl reference genome

4.5 years ago by

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

If you are pulling all the files from this directory, then you are merging multiple versions of the genome into a single file. Pick just one. The README file here and others in directories above it describe the data contents. A "toplevell" file is almost certainly what you will want to use - and most likely the masked version that does not include all of the unassembled fragments.

And once finished, you can check content (a very good idea): - once you have the fasta. Does it contain the same sequence identifiers as the reference annotation that you wish to use? Then you have the right one. Do the identifiers vary a bit? Modifying the identifiers before using in Galaxy. They must be an exact match between all inputs for the Tuxedo pipeline to work correctly.

Hopefully this helps, Jen, Galaxy team

ADD COMMENT • link written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

Thank you for your answer! I merged all the toplevel files by using cat command and used the merged file for Tophat2 analysis. But an error occur (please see the following). Do you know what may be the reason? Many thanks!!

-----------------------------------------------------------------------------

job id: 55870

tool id: tophat2

-----------------------------------------------------------------------------

job command line:

bowtie2-build "/galaxy/run/prod/database/files/098/dataset_98567.dat" genome ; ln -s "/galaxy/run/prod/database/files/098/dataset_98567.dat" genome.fa ; tophat2 --num-threads 4 -r 300 --mate-std-dev=20 genome /galaxy/run/prod/database/files/097/dataset_97704.dat /galaxy/run/prod/database/files/097/dataset_97705.dat

-----------------------------------------------------------------------------

job stderr:

Fatal error: Tool execution failed

Warning: Encountered reference sequence with only gaps

[2014-06-17 05:23:53] Beginning TopHat run (v2.0.1)

-----------------------------------------------

[2014-06-17 05:23:53] Checking for Bowtie Traceback (most recent call last):

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3901, in <module>

sys.exit(main())

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 3706, in main

check_bowtie(params)

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1381, in check_bowtie

bowtie_version = get_bowtie_version(params.bowtie2)

File "/apps/tuxedo/tophat/2.0.1/bin/tophat2", line 1264, in get_bowtie_version

bowtie_version = [int(x) for x in ver_numbers[:3]] + [int(ver_numbers[3][4:])]

IndexError: list index out of range

-----------------------------------------------------------------------------

job stdout:

Settings:

Output files: "genome.*.bt2"

Line rate: 6 (line is 64 bytes)

Lines per side: 1 (side is 64 bytes)

Offset rate: 4 (one in 16)

FTable chars: 10

Strings: unpacked

Max bucket size: default

Max bucket size, sqrt multiplier: default

Max bucket size, len divisor: 4

Difference-cover sample period: 1024

Endianness: little

Actual local endianness: little

Sanity checking: disabled

Assertions: disabled

Random seed: 0

Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA:

/galaxy/run/prod/database/files/098/dataset_98567.dat

Reading reference sizes

Time reading reference sizes: 00:00:54 Calculating joined length Writing header Reserving space for joined string Joining reference sequences

Time to join reference sequences: 00:00:43 bmax according to bmaxDivN setting: 865326026 Using parameters --bmax 648994520 --dcv 1024

Doing ahead-of-time memory usage test

Passed! Constructing with these parameters: --bmax 648994520 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample

Building sPrime

Building sPrimeOrder

V-Sorting samples

V-Sorting samples time: 00:05:14

Allocating rank array

Ranking v-sort output

Ranking v-sort output time: 00:00:41

Invoking Larsson-Sadakane on ranks

Invoking Larsson-Sadakane on ranks time: 00:01:42

Sanity-checking and returning

Building samples

Reserving space for 12 sample suffixes

Generating random suffixes

QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples

(Using difference cover)

Multikey QSorting samples time: 00:00:00 Calculating bucket sizes

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:44 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 7; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:02:53 Splitting and merging

Splitting and merging time: 00:00:00

Avg bucket size: 4.94472e+08 (target: 648994519) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:50

Sorting block of length 553695740

(Using difference cover)

Sorting block time: 00:28:13

Returning block of 553695741

Getting block 2 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:59

Sorting block of length 181049492

(Using difference cover)

Sorting block time: 00:09:14

Returning block of 181049493

Getting block 3 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:06

Sorting block of length 507123056

(Using difference cover)

Sorting block time: 00:25:50

Returning block of 507123057

Getting block 4 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:13

Sorting block of length 597268592

(Using difference cover)

Sorting block time: 00:31:10

Returning block of 597268593

Getting block 5 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:10

Sorting block of length 420285697

(Using difference cover)

Sorting block time: 00:21:52

Returning block of 420285698

Getting block 6 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:20

Sorting block of length 574709141

(Using difference cover)

Sorting block time: 00:29:27

Returning block of 574709142

Getting block 7 of 7

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:55

Sorting block of length 627172380

(Using difference cover)

Sorting block time: 00:32:11

Returning block of 627172381

Exited Ebwt loop

fchr[A]: 0

fchr[C]: 1096062216

fchr[G]: 1730811367

fchr[T]: 2365767559

fchr[$]: 3461304104

Exiting Ebwt::buildToDisk()

Returning from initFromVector

Wrote 1191375537 bytes to primary EBWT file: genome.1.bt2 Wrote 865326032 bytes to secondary EBWT file: genome.2.bt2 Re-opening _in1 and _in2 as input streams Returning from Ebwt constructor

Headers:

len: 3461304104

bwtLen: 3461304105

sz: 865326026

bwtSz: 865326027

lineRate: 6

offRate: 4

offMask: 0xfffffff0

ftabChars: 10

eftabLen: 20

eftabSz: 80

ftabLen: 1048577

ftabSz: 4194308

offsLen: 216331507

offsSz: 865326028

lineSz: 64

sideSz: 64

sideBwtSz: 48

sideBwtLen: 192

numSides: 18027626

numLines: 18027626

ebwtTotLen: 1153768064

ebwtTotSz: 1153768064

color: 0

reverse: 0

Total time for call to driver() for forward index: 03:31:40 Reading reference sizes

Time reading reference sizes: 00:00:39 Calculating joined length Writing header Reserving space for joined string Joining reference sequences

Time to join reference sequences: 00:00:44

Time to reverse reference sequence: 00:00:09 bmax according to bmaxDivN setting: 865326026 Using parameters --bmax 648994520 --dcv 1024

Doing ahead-of-time memory usage test

Passed! Constructing with these parameters: --bmax 648994520 --dcv 1024 Constructing suffix-array element generator Building DifferenceCoverSample

Building sPrime

Building sPrimeOrder

V-Sorting samples

V-Sorting samples time: 00:05:15

Allocating rank array

Ranking v-sort output

Ranking v-sort output time: 00:00:41

Invoking Larsson-Sadakane on ranks

Invoking Larsson-Sadakane on ranks time: 00:01:42

Sanity-checking and returning

Building samples

Reserving space for 12 sample suffixes

Generating random suffixes

QSorting 12 sample offsets, eliminating duplicates QSorting sample offsets, eliminating duplicates time: 00:00:00 Multikey QSorting 12 samples

(Using difference cover)

Multikey QSorting samples time: 00:00:00 Calculating bucket sizes

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:52 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 5; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:18 Splitting and merging

Splitting and merging time: 00:00:00

Split 1, merged 0; iterating...

Binary sorting into buckets

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Binary sorting into buckets time: 00:03:33 Splitting and merging

Splitting and merging time: 00:00:00

Avg bucket size: 3.84589e+08 (target: 648994519) Converting suffix-array elements to index image Allocating ftab, absorbFtab Entering Ebwt loop Getting block 1 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:44

Sorting block of length 253823515

(Using difference cover)

Sorting block time: 00:12:44

Returning block of 253823516

Getting block 2 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:58

Sorting block of length 584985421

(Using difference cover)

Sorting block time: 00:30:08

Returning block of 584985422

Getting block 3 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:03

Sorting block of length 178374514

(Using difference cover)

Sorting block time: 00:08:56

Returning block of 178374515

Getting block 4 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:11

Sorting block of length 545802700

(Using difference cover)

Sorting block time: 00:28:17

Returning block of 545802701

Getting block 5 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:13

Sorting block of length 472723375

(Using difference cover)

Sorting block time: 00:24:37

Returning block of 472723376

Getting block 6 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:16

Sorting block of length 576540110

(Using difference cover)

Sorting block time: 00:29:43

Returning block of 576540111

Getting block 7 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:14

Sorting block of length 176226007

(Using difference cover)

Sorting block time: 00:08:55

Returning block of 176226008

Getting block 8 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:01:20

Sorting block of length 531547336

(Using difference cover)

Sorting block time: 00:27:36

Returning block of 531547337

Getting block 9 of 9

Reserving size (648994520) for bucket

Calculating Z arrays

Calculating Z arrays time: 00:00:00

Entering block accumulator loop:

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Block accumulator loop time: 00:00:46

Sorting block of length 141281118

(Using difference cover)

Sorting block time: 00:07:27

Returning block of 141281119

Exited Ebwt loop

fchr[A]: 0

fchr[C]: 1096062216

fchr[G]: 1730811367

fchr[T]: 2365767559

fchr[$]: 3461304104

Exiting Ebwt::buildToDisk()

Returning from initFromVector

Wrote 1191375537 bytes to primary EBWT file: genome.rev.1.bt2 Wrote 865326032 bytes to secondary EBWT file: genome.rev.2.bt2 Re-opening _in1 and _in2 as input streams Returning from Ebwt constructor

Headers:

len: 3461304104

bwtLen: 3461304105

sz: 865326026

bwtSz: 865326027

lineRate: 6

offRate: 4

offMask: 0xfffffff0

ftabChars: 10

eftabLen: 20

eftabSz: 80

ftabLen: 1048577

ftabSz: 4194308

offsLen: 216331507

offsSz: 865326028

lineSz: 64

sideSz: 64

sideBwtSz: 48

sideBwtLen: 192

numSides: 18027626

numLines: 18027626

ebwtTotLen: 1153768064

ebwtTotSz: 1153768064

color: 0

reverse: 1

Total time for backward call to driver() for mirror index: 03:38:37

ADD REPLY • link written 4.5 years ago by Xiefan Fang • 30

The warnings indicate that a sequence comprised of only "N"s is in the dataset. Whether this is true or not, can be checked then removed (and reported back to the source if found). There are plenty of genomes used and indexed here that lead off with large strings of "N" content that do not produce this warning, so that is not the issue.

But - I would confirm "fasta" format before moving forward. Improper format could be triggering the error. In particular, make sure there are no extra lines or spaces and that the lines are wrapped. Here is some help:
https://wiki.galaxyproject.org/Support#Error_from_tools
https://wiki.galaxyproject.org/Learn/Datatypes#Fasta
https://wiki.galaxyproject.org/Support#Custom_reference_genome

Jen, Galaxy team

ADD REPLY • link written 4.5 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »