I would like help splitting a large vcf file containing WGS data into smaller files. Using Galaxy filters, I split it into smaller files by chromosome, but now they are tabular, not vcf files. Could you let me know how to either convert tabular back to vcf, or how to split the original vcf without changing file format?
I guess your problem comes from the header lines, which are present in VCF and start with a #
. If your filter does not keep them, then the result is no longer vcf, but just general tabular data.
Possible solutions:
1) Use the Select lines that match an expression
tool with a regular expression that matches lines starting with either a #
or one of your chromosome names followed by arbitrary characters, e.g., ^#|chrI.+
2) Break your problem into simpler subtasks: filter your dataset once with a filter that keeps only the header lines (based on them starting with a #
), then join each of your tabular single-chromosome datasets to this headr-only dataset thereby regenerating valid VCF format.
3) Most direct (but less instructive): try to use the MiModD VCF Filter
tool with appropriate Region Filters
.