Question: Fasta Files from FTP sites
0
gravatar for gkuffel22
3.6 years ago by
gkuffel22170
United States
gkuffel22170 wrote:

Hi,

I have used FTP to download the mouse genome from NCBI, Ensembl, and UCSC. When I navigate to the genome.fa file  for a closer look this file only contains a series of "N" characters, no nucleotides are in this file ("ACTG"). When I try to use these FASTA files in Galaxy as my custom reference genome the tools obviously throw errors. Does anyone know why these files appear to lack sequence data? Thanks for your help.

galaxy • 1.5k views
ADD COMMENTlink modified 3.6 years ago by Jennifer Hillman Jackson25k • written 3.6 years ago by gkuffel22170
1
gravatar for Jennifer Hillman Jackson
3.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

The genomes may contain N content (representing chromatin AFAIK - the first 30k of mouse has that at UCSC, per chrom). However, the rest of the sequence should be fine (is just soft-masked, if you picked that version). You can compare to the exact version used on Galaxy Main (http://usegalaxy.org) by accessing our rsync server:
http://wiki.galaxyproject.org/Admin/UseGalaxyRsync

As another example, to show which version of the genome is used on Main, this was the source for mm10 (we use similar versions for all UCSC-sourced genomes):

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/

chromFa.tar.gz - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

The "mm10.2bit" file has the same content, all chromosomes in one file. Sometimes this is easier to download/work with. Use the UCSC utility "twoBitToFa" to convert to fasta (available from UCSC's source downloads, same web site, located here: http://hgdownload.soe.ucsc.edu/admin/exe/).

Thanks, Jen, Galaxy team

ADD COMMENTlink written 3.6 years ago by Jennifer Hillman Jackson25k
0
gravatar for Daniel Blankenberg
3.6 years ago by
Daniel Blankenberg ♦♦ 1.7k
United States
Daniel Blankenberg ♦♦ 1.7k wrote:

Are you sure that the entire files are just strings of Ns? Often, the ends of chromosomes will be filled with these characters.

What downstream errors are you getting from Galaxy tools?

ADD COMMENTlink written 3.6 years ago by Daniel Blankenberg ♦♦ 1.7k

No I guess I am not sure, I just have never seen a FASTA file filled with these characters, I usually work with bacterial genomes and I've never come across that. The downstream errors I am getting are from running TopHat. Here is the error: Couldn't build bowtie index with err = 1. I know that in TopHat, Bowtie first builds indexes from the reference FASTA file so I figured it was the string of "N's" causing the issue.

ADD REPLYlink written 3.6 years ago by gkuffel22170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 106 users visited in the last hour