I'm working with two files that contain nucleotide count data from the 'variant annotator' tool. I filtered the two files to ensure that they only contained genomic positions that appeared in both files, so there shouldn't be any lines that are unique to only one sample. I merged the two files together (about ~40 million lines each) using the join tool (which resulted in a file with ~43 million lines for some reason, but I figure that's just each tool estimating the number of lines in a slightly different way?).
Then I used the compute tool to add together two of the columns in the joined file. This is where I've got a problem - the compute tool only outputs ~12 million lines. Additionally, the tool tells me this: kept 100.00% of 35827546 lines. Skipped 22883409 invalid lines starting at line #12944138: "__NONE__ 4_group1 1005265 0 1 0 0 1 0 C . 0.0 4_group11005265 __NONE__ 4_group1 1005265 0 1 0 0 1 0 C . 0.0 4_group11005. So, it apparently kept 35 million lines, and discarded 22 million lines, despite the original file only containing ~43 million lines. So the output doesn't jive with what the tool is telling me, OR with the file that it originated from.
Any idea what's going on?
Suzanne
PS. I also can't figure out why it discarded a whole bunch of lines. Each line should be in the same format, so it should be able to compute on every line. I should get exactly as many lines output as were inputted.