Question: Get Flanks skipped invalid lines
0
gravatar for d.angra
2.7 years ago by
d.angra50
United Kingdom
d.angra50 wrote:

Hello Galaxy Experts

I have a query regarding "Get flanks" tool under "Operate on genomic intervals" to get the genomic sequences flanking all of the SNPs. I notice that it throws an error that it skips certain number of invalid lines. To let you know I have checked all the lines in the files from which I have fetched flanks from around SNPs and there seems to be no line without genomic coordinates or something which could be a source of error. My data has 770,000 SNPs. The tool has skipped 232217 lines which is a substantial number of SNPs.I want to identify these lines. Is there any way I can do it? On thinking on this issue I realise that there can insufficient coverage in this area, I definitely need to know what these lines are, and if possible get flanks for these SNPs.

I am in between of witting a full paper for which this information is very important. For these reasons I would appreciate early replies.

D

get flanks • 858 views
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by d.angra50
0
gravatar for Jennifer Hillman Jackson
2.7 years ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

Try using "Cut" to restrict the data to just the input the tool needs. Adjust the datatype and column assignments after (pencil icon -> Edit Attributes) as needed. Then re-run to see if that fixes the problem.

Some help from tool form is quoted below, but see the full section in the UI for input examples. Name and strand can be included.

Note: Every line should contain at least 3 columns: Chromosome number, Start and Stop co-ordinates. If any of these columns is missing or if start and stop co-ordinates are not numerical, the tool may encounter exceptions and such lines are skipped as invalid. The number of invalid skipped lines is documented in the resulting history item as a "Data issue".

If this doesn't solve the problem, using "Group" or "Datamash" can help to check the content. Other tools in the Text Manipulation tool group can also be used to detect problems. I tend to use "Compute" to compare data on the same line but in different rows.

Hopefully you are able to sort out the problem, Jen, Galaxy team

ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k
0
gravatar for d.angra
2.7 years ago by
d.angra50
United Kingdom
d.angra50 wrote:

Hi Jen Thankyou for reply. I have tried these things and I find my data is fine in aspect of datatype, column etc. The solution which you suggested would possibly be true for data in one file only but my data are in two different files. The two files , 1) SNPs I have are in pg SNP format and 2) Get Flanks in interval format. I need to find the lines from pgSNP which are skipped during "get flanks" tool. I want to compare two files. Sorry if I have not misunderstood anything.

D

ADD COMMENTlink written 2.7 years ago by d.angra50

With this input, you could try the Genome Diversity: Flanking Sequence tool with the output set to fasta. (If not already, I assumed you were using the Operate on Genomic Intervals: Get Flanks tool originally). The fasta headers are the same as the first column in the pg_snp dataset, both can be parsed out, and compared ("Compare two datasets").

Comparing/Joining on overlapping intervals probably won't produce the type of output that you want.

Although finding out what is wrong with the input (lines that are skipped due to some format issue) seems worth examining. But you decide.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k

Hi Jen

I had tried this also but I think it required "choose species" as a must. Could you please specify which "species" category should I use for this exercise? Hope to find an early solution. D

ADD REPLYlink written 2.7 years ago by d.angra50

Before running "Flanking Sequence", convert the pg_snp data to gd_snp using "Make File : Build a gd_snp or gd_genotype file". This way the tool will interpret columns and species ("database" name as a dbkey)

Note that these files have the target species (dbkey) in the header lines. https://usegalaxy.org/static/formatHelp.html

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 175 users visited in the last hour