Question: Subtract
gravatar for Xianrong Wong
6.6 years ago by
Xianrong Wong90 wrote:
Hello, I am using the subtract (whole dataset) tool. I converted my fastq file to tabular with 2 columns: 1. Identifier and 2. sequence. I then "selected (a few) lines that match an expression" from this initial tabular file and am trying to get a final dataset that is devoid of reads with the few selected lines - thus I subtract the dataset of selected lines from the initial dataset. This tool works with I am performing the workflow on a relatively small file (1/50 the size of a whole sequencing experiment) but repeatly fails when I input the full fastq file. Any idea why this is so? Jose
ADD COMMENTlink modified 6.6 years ago by Jennifer Hillman Jackson25k • written 6.6 years ago by Xianrong Wong90
gravatar for Jennifer Hillman Jackson
6.6 years ago by
United States
Jennifer Hillman Jackson25k wrote:
Hello, Using the 'Subtract' tool between FASTQ datasets can be memory intensive since it literally involves sorting and then comparing each character between the two files. This is likely not necessary. I have seen queries such as yours run successfully on even very large datasets by eliminating the Subtract step and instead using a 'Select' with "NOT Matching' on the original dataset. Example: current dataflow: 1 - original file A 2 - select positive match expression 'X' to create file B 3 - subtract file B from file A to create file C better: 1 - original file A 2 - select negative match expression 'X' to create file C If this failure is on the public main Galaxy server and you do not wish to change your query, then moving to a cloud instance and experimenting with larger memory options is one suggestion: Hopefully this helps, Jen Galaxy team -- Jennifer Jackson
ADD COMMENTlink written 6.6 years ago by Jennifer Hillman Jackson25k
