I have WXS fastq files from an illumina HiSeq 4000 paired end run- I uploaded them through FTP as fastqillumina. They are each about 24 GB. Reads look fine using FastQ Summary Statistics. I aligned to hg19 using BWA for illumina, and got a SAM file that is 62GB. Then I took the SAM file and tried to run SAMTOOLS SAM to BAM. This ran for a few hours and the output BAM file is 1.8 KB, (KILObytes - as in tiny). Please let me know where I went wrong with this workflow... Any help would be greatly appreciated. Thank you very much.
Hello,
Two areas to correct/adjust:
1. Database and Sorting
Does the input dataset have the correct reference genome assigned as the "database"? Samtools requires this as well as sorted input.
Fix: Assign the correct "database". If you used a Custom Reference genome for alignment, then create a Custom Build from that to assign. Sort the input BAM dataset.
How to change datatype: https://wiki.galaxyproject.org/Support#Tool_doesn.27t_recognize_dataset
How to create a Custom Build. Other CG formatting rules on the same wiki: https://wiki.galaxyproject.org/Learn/CustomGenomes
Sorting tips: https://github.com/jennaj/support-prior-qa/wiki/Sort-your-inputs
2. Datatype: Fastqsanger
Tools require .fastqsanger formatted sequence/quality scores. I suspect your data is already in this format and the assignment of .fastqillumina is causing problems. Prior Q&A and bug reports with this type of result (low hits) are often due to the wrong sequence datatype as input - in content or by datatype assignment.
Fix: Double check format and Fastq Groom or assign the correct datatype. Don't just change the assigned datatype or more unexpected results can occur. This is how: https://wiki.galaxyproject.org/Support#FASTQ_Datatype_QA
Thanks, Jen, Galaxy team