Compute tool - invalid lines?

Question: Compute tool - invalid lines?

4.3 years ago by

Canada

Suzanne Gomes • 120 wrote:

I'm working with two files that contain nucleotide count data from the 'variant annotator' tool. I filtered the two files to ensure that they only contained genomic positions that appeared in both files, so there shouldn't be any lines that are unique to only one sample. I merged the two files together (about ~40 million lines each) using the join tool (which resulted in a file with ~43 million lines for some reason, but I figure that's just each tool estimating the number of lines in a slightly different way?).

Then I used the compute tool to add together two of the columns in the joined file. This is where I've got a problem - the compute tool only outputs ~12 million lines. Additionally, the tool tells me this: kept 100.00% of 35827546 lines. Skipped 22883409 invalid lines starting at line #12944138: "__NONE__ 4_group1 1005265 0 1 0 0 1 0 C . 0.0 4_group11005265 __NONE__ 4_group1 1005265 0 1 0 0 1 0 C . 0.0 4_group11005. So, it apparently kept 35 million lines, and discarded 22 million lines, despite the original file only containing ~43 million lines. So the output doesn't jive with what the tool is telling me, OR with the file that it originated from.

Any idea what's going on?

Suzanne

rna-seq compute • 1.1k views

ADD COMMENT • link •

modified 4.3 years ago • written 4.3 years ago by Suzanne Gomes • 120

PS. I also can't figure out why it discarded a whole bunch of lines. Each line should be in the same format, so it should be able to compute on every line. I should get exactly as many lines output as were inputted.

ADD REPLY • link written 4.3 years ago by Suzanne Gomes • 120

4.3 years ago by

Suzanne Gomes • 120

Canada

Suzanne Gomes • 120 wrote:

Ok, I figured out what wasn't working. In case anybody else runs into a problem like this:

My dataset did have the same number of columns in each line, but one of the columns, describing the chromosomes, had single digits in some lines (ex. '2'), while some were longer (ex. '4_group1'). The compute tool apparently didn't like this. Once I separated the file into two separate files based on this column, I was able to run compute on both files successfully.

ADD COMMENT • link written 4.3 years ago by Suzanne Gomes • 120

Hi Suzanne, I am glad you found the format problem. This is another requirement of the Compute tool (and certain other manipulation tools) - that all items in a column are of the same data "type". The first few rows are used to determine "type": 'numerical' versus 'string'. (The length of the data values are not a factor, just whether all numerical or not).

Thanks for posting back your solution, it will surely help others with format. An alternate solution would be to do something such as add a "chr" to the chromosome values, changing "2" to "chr2", "4_group1" to "chr4_group1", etc.

Thanks! Jen, Galaxy team

ADD REPLY • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

4.3 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Suzanne,

The tool requires that all input lines have the same number of columns - so a file format issue is the root cause of the problem. I suggested that you review the content of problematic lines using the "Select" tool (sent to you via the bugs mailing list). Am just sharing this so others know the resolution and how to interpret/act on error message such as this.

I am glad this is working on other files. Thanks for following up directly, appreciated! Jen, Galaxy team

ADD COMMENT • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

I actually wasn't on the bugs mailing list, so I didn't receive that 'select' tool. I joined yesterday, but it's still pending approval from from an admin, I think, so I still wouldn't be able to receive it. Is there anywhere else I can access it?

ADD REPLY • link written 4.3 years ago by Suzanne Gomes • 120

Hi Suzanne,

The bugs mailing list is a private list that galaxy team members all receive, it isn't one you can join. When you submit a bug report in Galaxy, or directly email galaxy-bugs it goes to this list. When we reply, you should still get a response without being on the list. That said, did Jen's email maybe go to spam or something?

ADD REPLY • link written 4.3 years ago by Dannon Baker ♦ 3.7k

Oh, I see! Well that clears up my confusion. However, the email is not in my spam folder. It looks like I didn't receive anything. Could someone try resending it?

ADD REPLY • link written 4.3 years ago by Suzanne Gomes • 120

Hi Suzanne, There was no bug report sent in about this problem that I can locate associated with your email address (just the Tophat question). The solution below should work to identify the problem. However, if you still have problems, please send in a bug report directly from the Galaxy interface pasting in a link to this question in the comments to aid with association (use the green bug icon that appears in the error dataset). Please note that this will only function when using public Main Galaxy instance at http://usegalaxy.org. If working on another server, you can test if the problem is reproducible on the public Main site, then report the problem (although the solution will most likely be the same, and one that will need to be corrected in the input dataset itself prior to upload). Best, Jen, Galaxy team

ADD REPLY • link written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

Hi Suzanne, We emailed directly to your email address, with the private bugs list cc'd. These were delivered and replied to, so that is working correctly. (For a question regarding Tophat and this one). The Tophat question had an email labeled by your name, the other different email about a similar "Group" usage issue, sent directly from a http://usegalaxy.org account. My guess is now that the "Group" question was not from you, but another user. Apologies for the mix-up, it is somewhat common for different email addresses to be used during correspondence through that mailing list.

For this problem, the issue is almost certainly with the format of the input file. The first line in the file that does not have the expected format is reported in errors. To examine this line, use the tool "Filter and Sort -> Select" using unique content of that line. Or, (and this sometimes is a better strategy) you can remove the proceeding lines with the tool "Text Manipulation -> Remove beginning of a file" using the line number reported in the error. In your case, try a number a few less than "12944138" in order to compare lines that differ. For example, "12944136" would put the a line that was accepted, then the problematic line, at the top of the dataset where easily reviewed (the remainder of the dataset would follow).

Once the format is resolved, line counts will resolve into correct numbers. The tool "Text Manipulation -> Line/Word/Character count of a dataset" can be used at any time on datasets without serious format issues.

Hopefully this helps to isolate the problem, Jen, Galaxy team

ADD REPLY • link modified 4.3 years ago • written 4.3 years ago by Jennifer Hillman Jackson ♦ 25k

Please log in to add an answer.

Similar posts • Search »