Question: Galaxy interval format - what should be provided as CHROM#?
2.1 years ago
netnauke wrote:

Dear colleagues,

I have a .txt file with >100 lines of the following format (1st column - sequence ID, 2nd - start coordinate, 3rd - end coordinate, 4th - strand; everything separated by TABs):

PA14sr_076    2867353    2867490    +

I am trying to fetch the sequences from the full genome sequence corresponding to these coordinates. I figured out that for this I could convert my file into Interval format. However, I do not understand, what should I use as a CHROM# in this case? As this is a bacterial species, it has only one chromosome anyway.. And when I check the full genome sequence in GenBank - it has no identifiers similar to chromosome or something.

When I am doing "Extract genomic DNA" without providing CHROM# - I get an empty output. Could someone help me, please? Thanks in advance...

2.1 years ago
United States
Jennifer Hillman Jackson wrote:


The attribute for chrom is a sequence identifier from one of the sequences in the reference genome (one or more chromosomes, or multiple scaffold/contigs, or some combination). The start and end are positions on a specific sequence contained within the reference genome with strand used as a modifier.

Examine the reference genome to understand the identifiers used. Then double check that the coordinates are based on those sequences. My guess is that the example region you shared has a genetic region name as chrom identifier (not the reference genome's actual chromosome name). However, the start/end appear to be genomic coordinates. 

More about reference genomes

More about interval and bed format:

Bed format is a better choice with this particular tool. Pad columns with default values when there is no known content (name and score). Tools in the group Text Manipulation can be used, or format the data prior to upload.

Thanks, Jen, Galaxy team

2.1 years ago
netnauke wrote:

Thanks a lot  for the help, everything works now.

