13 months ago by
United States
Hello,
Try this:
- Extract the fastq sequences from the SAM file using NGS: Picard > SamToFastq
- Convert the fastq to a tabular dataset using Convert Formats > Tabular to FASTQ converter
- Filter out just the fields you want to retain (sequence identifier plus sequence?) using Text Manipulation > Cut
**Optional additional steps to remove any duplicates:
- Convert the tabular data to fasta using Convert Formats > Tabular-to-FASTA
- Collapse duplicate reads using NGS: QC and manipulation > Collapse sequences
- Convert fasta back to tabular using Convert Formats > FASTA-to-Tabular
** There are other tools that will find "unique lines" in tabular datasets, but I'm not sure if they will work well on such a large dataset with longer data in the fields (the sequence). You could try though. An error would not be a bug but means the data is too large/complex to process this way and to use the original method above instead.
Any plain text file that has tabs separating columns can be imported into Excel. The limitation would be the "max lines" accepted by Excel (somewhere around 30-40k ?? you can google to check). Give the file the extension .txt
during download from Galaxy, or after, so that Excel will recognize the file.
Hope that helps! Jen, Galaxy team