Get Flanks skipped invalid lines

Question: Get Flanks skipped invalid lines

2.7 years ago by

United Kingdom

d.angra • 50 wrote:

Hello Galaxy Experts

I have a query regarding "Get flanks" tool under "Operate on genomic intervals" to get the genomic sequences flanking all of the SNPs. I notice that it throws an error that it skips certain number of invalid lines. To let you know I have checked all the lines in the files from which I have fetched flanks from around SNPs and there seems to be no line without genomic coordinates or something which could be a source of error. My data has 770,000 SNPs. The tool has skipped 232217 lines which is a substantial number of SNPs.I want to identify these lines. Is there any way I can do it? On thinking on this issue I realise that there can insufficient coverage in this area, I definitely need to know what these lines are, and if possible get flanks for these SNPs.

I am in between of witting a full paper for which this information is very important. For these reasons I would appreciate early replies.

get flanks • 858 views

ADD COMMENT • link •

modified 2.7 years ago • written 2.7 years ago by d.angra • 50

2.7 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello,

Try using "Cut" to restrict the data to just the input the tool needs. Adjust the datatype and column assignments after (pencil icon -> Edit Attributes) as needed. Then re-run to see if that fixes the problem.

Some help from tool form is quoted below, but see the full section in the UI for input examples. Name and strand can be included.

Note: Every line should contain at least 3 columns: Chromosome number, Start and Stop co-ordinates. If any of these columns is missing or if start and stop co-ordinates are not numerical, the tool may encounter exceptions and such lines are skipped as invalid. The number of invalid skipped lines is documented in the resulting history item as a "Data issue".

If this doesn't solve the problem, using "Group" or "Datamash" can help to check the content. Other tools in the Text Manipulation tool group can also be used to detect problems. I tend to use "Compute" to compare data on the same line but in different rows.

Hopefully you are able to sort out the problem, Jen, Galaxy team

ADD COMMENT • link modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

2.7 years ago by

d.angra • 50

United Kingdom

d.angra • 50 wrote:

Hi Jen Thankyou for reply. I have tried these things and I find my data is fine in aspect of datatype, column etc. The solution which you suggested would possibly be true for data in one file only but my data are in two different files. The two files , 1) SNPs I have are in pg SNP format and 2) Get Flanks in interval format. I need to find the lines from pgSNP which are skipped during "get flanks" tool. I want to compare two files. Sorry if I have not misunderstood anything.

ADD COMMENT • link written 2.7 years ago by d.angra • 50

With this input, you could try the Genome Diversity: Flanking Sequence tool with the output set to fasta. (If not already, I assumed you were using the Operate on Genomic Intervals: Get Flanks tool originally). The fasta headers are the same as the first column in the pg_snp dataset, both can be parsed out, and compared ("Compare two datasets").

Comparing/Joining on overlapping intervals probably won't produce the type of output that you want.

Although finding out what is wrong with the input (lines that are skipped due to some format issue) seems worth examining. But you decide.

ADD REPLY • link modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Jen

I had tried this also but I think it required "choose species" as a must. Could you please specify which "species" category should I use for this exercise? Hope to find an early solution. D

ADD REPLY • link written 2.7 years ago by d.angra • 50

Before running "Flanking Sequence", convert the pg_snp data to gd_snp using "Make File : Build a gd_snp or gd_genotype file". This way the tool will interpret columns and species ("database" name as a dbkey)

Note that these files have the target species (dbkey) in the header lines. https://usegalaxy.org/static/formatHelp.html

ADD REPLY • link modified 2.7 years ago • written 2.7 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »