Question: pipeline for DNA-seq analysis
rgambhir wrote:

Thanks for all your help. Finally I got the data uploaded on the Galaxy. As suggested there was a problem in uploading my fastaq.gz files. now everything looks fine. i would like to start analyzing my data. I looked at the FASTQC reports and everything looks good. I have DNA sequences derived from plasma samples of cancer patients (cell free DNA). i am interested in aligning this sequence with recent human genome build up for deciphering any mutations, insertions/deletion or copy number variations. Please advice what is the next step before I go into my data analysis







Jennifer Hillman Jackson wrote:

Hi Ratish,

Glad that you were able to upload your data. For data prep, starting with FastQC is a great choice. From there, just make sure that the data has the quality scores scaled correctly and that the datatype labels are correct. Help for that is in the Galaxy wiki here: section 2.10.1

For the analysis, if using GATK, make sure that you align versus the 1000 Genomes version of the human genome (hg_g1k_b37), if the data are human. This will allow you to use the indexes already in place. If using other tools, then hg19 and hg38 are also choices.

In short, decide which target genome to use (human or other) based on what is available for the other inputs you plan to use (reference annotation datasets, such as dbSNP and others). The availability of these can vary by genome and genome build. All inputs must be based on the same exact genome build. Once you know the inputs, then map. If you wait to look for downstream inputs until after mapping, you may find that what is available (or the best choice) are not a match for the build you selected for mapping, which means starting over - that is never fun.

Good luck with your project, Jen, Galaxy team

