Question: splitting and rejoining datasets
1
gravatar for mm4184
17 months ago by
mm418420
mm418420 wrote:

I am working with large fastq files which are too big to use on galaxy (about 120 gb, so even after purging everything else I still don't have enough space to do a quality trim on the data). Does anyone know if there are programs to split my dataset into pieces, which I could recombine after batch quality trimming and mapping to a reference? I have seen file merge tools, but not file split tools.

Thanks.

chip-seq • 472 views
ADD COMMENTlink modified 17 months ago by Jennifer Hillman Jackson25k • written 17 months ago by mm418420
2
gravatar for Guy Reeves
17 months ago by
Guy Reeves1.0k
Germany
Guy Reeves1.0k wrote:

I assume you are looking for a tool outside galaxy.

I guess you will be able to use

head -n1500000 > file_name

to split the fastq files into smaller sizes ( I guess that if the number of lines is even, in this case 1500000, then you should be OK) . I guess if take the same number of lines in each fastq pair you should be OK

Then you can import into galaxy map each one and then merge the .BAMs. Might that work?

ADD COMMENTlink modified 17 months ago • written 17 months ago by Guy Reeves1.0k
1
gravatar for Jennifer Hillman Jackson
17 months ago by
United States
Jennifer Hillman Jackson25k wrote:

Hello,

To do this within Galaxy, see the tools in the Text Manipulation tool groups:

  • Select first lines of a dataset (same as head line-command)
  • Select last lines of a dataset (same as tail line-command)

Thanks, Jen, Galaxy team

ADD COMMENTlink written 17 months ago by Jennifer Hillman Jackson25k
1

HI Jen of course the tools you mention are the way to go if the 130Gb file could be uploaded to use galaxy.org. I think their plans is to slice the file outside galaxy in the hope that the final merged bam is less than the 100Gb limit. I guess it could work. Cheers Guy

ADD REPLYlink written 17 months ago by Guy Reeves1.0k

Thanks, Guy, I missed the file size part. So, unix is definitely the way to do this.

But small warning - when using public servers, processing extremely large datasets present problems when running tools (re: memory or walltime failures), produce results that take up too much account quota space, and the like .. when working at http://usegalaxy.org and all other public Galaxy servers I can think of. Even if the trimming works in batches, mapping then presents the next hurdle. A cloudman galaxy using AWS resources or a local set up with sufficient resources would be the best way to get large data processed from a practical perspective.

I know you've seen/know about this, but maybe mm4184 hasn't yet (or others reading this post): https://galaxyproject.org/choices/

ADD REPLYlink modified 17 months ago • written 17 months ago by Jennifer Hillman Jackson25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 171 users visited in the last hour