Using Segments Of Sequences As A Reference Genome

Question: Using Segments Of Sequences As A Reference Genome - Bowtie For Illumina

5.7 years ago by

Dear all, My problem seems like something that should have a very simple solution from my end and due to my lack of knowledge in bioinformatics, I am probably messing up with the workflows. The experiment I run is one where we used Miseq to sequence amplicons of a multiplex PCR. We introduced an inhouse barcodeto our PCR products via an adaptor. Miseq data was demultiplexed for the Illumina barcodes using Miseq reporter on intrument software by our service provider and I am trying to run the rest of the process on Galaxy web port with no command prompt programming. The data for R1 and R2 was imported, and then I used barcode splitter to de-multiplex the amplicons after quality triming. (I did not use FASTQ groomer as Miseq data is supposed to be Sanger FastQ than Illumina). Then the sequence trimmer was used to trim the barcode+adaptor sequences. The results of this were re-uploaded and designated as FASTQ for alignment. Now for the reference genome, as our aplicons are of from different sequences, we have segmented FASTA sequences in one file with different FASTA identifiers. When this file was input as the reference genome and mapping was performed using Bowtie for Illumina, the mapping went on with no errors. I could filter the alignment file using SAM filters too. But I can not do any more downstream visualozations, not even SAM to BAM conversion. I suspect that this may be due to an error in the way that the reference genome was formulated but can not get around to figure it out. I would be extremely grateful if you could help me with this issue. I tihnk if I string together the sequences as one it would work, but converting this back for interpretation becomes an issue then. Thank you, Kind Regards, Veranja Veranja Liyanapathirana Graduate Student (Microbiology)

alignment bowtie • 2.7k views

ADD COMMENT • link •

modified 5.7 years ago • written 5.7 years ago by Veranja Liyanapathirana • 70

5.7 years ago by

Veranja Liyanapathirana • 70

Veranja Liyanapathirana • 70 wrote:

Dear Galaxy team/ users, I am sorry to spam the thread again but I still could not figure out what is worng with my work flow and need some help. As mentioned earlier, I use Miseq reads, demultiplex for an inhouse barcode using barcode splitter, re-upload and map with a ref sequence that is consisting of multiple short reference sequences. The work flow goes well up to this stage, conversion from SAM to BAM after filtering the SAM files also fine but I can not use the GATK depth of coverage tool to get the alignment data or create pileups. An error comes up in all instances. I would really appreciate any inputs in to this. Thanks a lot, Veranja Liyanapathirana Graduate Student (Microbiology) ________________________________ To: galaxy-user <galaxy-user@lists.bx.psu.edu> Subject: Using segments of sequences as a reference genome - Bowtie for Illumina Dear all, My problem seems like something that should have a very simple solution from my end and due to my lack of knowledge in bioinformatics, I am probably messing up with the workflows. The experiment I run is one where we used Miseq to sequence amplicons of a multiplex PCR. We introduced an inhouse barcodeto our PCR products via an adaptor. Miseq data was demultiplexed for the Illumina barcodes using Miseq reporter on intrument software by our service provider and I am trying to run the rest of the process on Galaxy web port with no command prompt programming. The data for R1 and R2 was imported, and then I used barcode splitter to de-multiplex the amplicons after quality triming. (I did not use FASTQ groomer as Miseq data is supposed to be Sanger FastQ than Illumina). Then the sequence trimmer was used to trim the barcode+adaptor sequences. The results of this were re-uploaded and designated as FASTQ for alignment. Now for the reference genome, as our aplicons are of from different sequences, we have segmented FASTA sequences in one file with different FASTA identifiers. When this file was input as the reference genome and mapping was performed using Bowtie for Illumina, the mapping went on with no errors. I could filter the alignment file using SAM filters too. But I can not do any more downstream visualozations, not even SAM to BAM conversion. I suspect that this may be due to an error in the way that the reference genome was formulated but can not get around to figure it out. I would be extremely grateful if you could help me with this issue. I tihnk if I string together the sequences as one it would work, but converting this back for interpretation becomes an issue then. Thank you, Kind Regards, Veranja Veranja Liyanapathirana Graduate Student (Microbiology)

ADD COMMENT • link written 5.7 years ago by Veranja Liyanapathirana • 70

Dear all, I was using the barcode splitter on Miseq paired end reads, however I am not sure if I did it correctly as the results I get in terms of the number of reads alocated per each barcode does not tally with the resutls obtained by the our service provider by one of their in-house script based methods. I use it for splitting some inhouse barcodes. I need to make sure that read 1 and read 2 are split in to the same group, and drop the sequences where this criteria is not met. Not sure how to get about doing this. Would using FASTQ joiner on the two reads and subsequent splitting work? Thank you, Kind Regards, Veranja ________________________________ To: galaxy-user <galaxy-user@lists.bx.psu.edu> Subject: Error in creating Depth of Coverage files after Bowtie for Illumina alignment Dear Galaxy team/ users, I am sorry to spam the thread again but I still could not figure out what is worng with my work flow and need some help. As mentioned earlier, I use Miseq reads, demultiplex for an inhouse barcode using barcode splitter, re-upload and map with a ref sequence that is consisting of multiple short reference sequences. The work flow goes well up to this stage, conversion from SAM to BAM after filtering the SAM files also fine but I can not use the GATK depth of coverage tool to get the alignment data or create pileups. An error comes up in all instances. I would really appreciate any inputs in to this. Thanks a lot, Veranja Liyanapathirana Graduate Student (Microbiology) ________________________________ To: galaxy-user <galaxy-user@lists.bx.psu.edu> Subject: Using segments of sequences as a reference genome - Bowtie for Illumina Dear all, My problem seems like something that should have a very simple solution from my end and due to my lack of knowledge in bioinformatics, I am probably messing up with the workflows. The experiment I run is one where we used Miseq to sequence amplicons of a multiplex PCR. We introduced an inhouse barcodeto our PCR products via an adaptor. Miseq data was demultiplexed for the Illumina barcodes using Miseq reporter on intrument software by our service provider and I am trying to run the rest of the process on Galaxy web port with no command prompt programming. The data for R1 and R2 was imported, and then I used barcode splitter to de-multiplex the amplicons after quality triming. (I did not use FASTQ groomer as Miseq data is supposed to be Sanger FastQ than Illumina). Then the sequence trimmer was used to trim the barcode+adaptor sequences. The results of this were re-uploaded and designated as FASTQ for alignment. Now for the reference genome, as our aplicons are of from different sequences, we have segmented FASTA sequences in one file with different FASTA identifiers. When this file was input as the reference genome and mapping was performed using Bowtie for Illumina, the mapping went on with no errors. I could filter the alignment file using SAM filters too. But I can not do any more downstream visualozations, not even SAM to BAM conversion. I suspect that this may be due to an error in the way that the reference genome was formulated but can not get around to figure it out. I would be extremely grateful if you could help me with this issue. I tihnk if I string together the sequences as one it would work, but converting this back for interpretation becomes an issue then. Thank you, Kind Regards, Veranja Veranja Liyanapathirana Graduate Student (Microbiology)

ADD REPLY • link written 5.6 years ago by Veranja Liyanapathirana • 70

Hi Veranja, I am going to try to address all questions in one go since they are all in the same thread. Next time though, it would be best send new questions as a brand new question, not as a reply with just the subject line changed. This helps us greatly with tracking and other users when searching prior posts. In the first email you seemed to have some trouble with the format of your custom reference genome, but later in the second email this seems to be resolved, at least as far as format is concerned (SAM->BAM conversion is possible using this genome, in Galaxy?). I am going to point you to our help for custom reference genomes, and if you click through to the main page there is a table with detailed format troubleshooting help. But, I will tell you first that I do not believe that this is going to be helpful for your overall goals, if I am understanding correctly. But, here is the link: http://wiki.galaxyproject.org/Support#Custom_reference_genome Your reference genome sounds as if it is not really a reference genome but instead more of a collection of short read sequences? If this number is very large, and the sequences are very short, you will likely run into memory or related indexing problems with many tools. There really isn't an easy way around this. You could try taking the analysis to a cloud version of Galaxy and scaling up the memory to see if that helps. You also might try breaking the job up into smaller jobs - you mentioned that the data is from multiple genomes - perhaps split by genome. But you will have to test this - I don't know the actual profile of your data. I can let you know that using purely a short read dataset, in particular one that has redundancy, will be problematic, likely no matter what is attempted. Some assembly or other strategy is likely required to move forward. Galaxy CloudMan: http://usegalaxy.org/cloud For the last question, different tools are probably expected to vary a bit in the results since they use a different method. If you want to compare datasets, using identifiers would be a good way. Convert the files to tabular, cut out the identifiers, compare these to find differences, then adjust the tabular files as needed, and convert back to fastq/fasta. Tools to do these sorts of functions are in the tool groups "Text Manipulation", "FASTA manipulation", "Filter and Sort, and Join", "Subtract and Group", "NGS: QC and manipulation". I know that seems like a lot of places to look - but use the tool search at the top of the tool panel and search by data type or tool name to make finding these easier, for example "Cut" or "Join" or "Tabular" - these tools have the names you would probably expect them to have and tool help is directly on each form. Our 101 tutorial also would be a good introduction for an overview: https://main.g2.bx.psu.edu/u/aun1/p/galaxy101 Hopefully this gives you some helpful information to work with, Jen Galaxy team -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org

ADD REPLY • link written 5.6 years ago by Jennifer Hillman Jackson ♦ 25k

Dear Dr. Jennifer, Thank you very much for the reply. I am sorry about the way the thread was handled. Your input is much helpful. However, I have one key question which probably didnt come across well in the third question and I will try to re-word it. I am using Illumina paired end data and need to de-multiplex some inhouse barcodes. I would like to know how best to use "barcode splitter" for this. I can think of two ways, 1. to use on read 1 and two separately 2. To join read 1 and read 2 via FASTQ joiner and use the barcode splitter on the joined data - use FASTQ splitter prior to mapping. In method one, I need to figure out a way to exclude reads that are not sorted in to the same split group in read one and two and discard them from subsequent analysis, the possible way to do this as far as I can figure out seems to be again to use FASTQ joiner so that the reads without the same identifier in R1 and R2 would be discarded. Is there any other ways to do this? Also, when using the barcode files for read one and read two, is there a need to change the "orientation" (i.e complement) of the barcode in the barcode file for read 2? In using the second method, when one uses the barcode splitter, would the barcode splitter look at both R1 and R2 or just the R1 in splitting the reads? Thanks a lot, Kind Regards, Veranja ________________________________ To: Veranja Liyanapathirana <veranjacl@yahoo.com> Cc: galaxy-user <galaxy-user@lists.bx.psu.edu> Subject: Re: [galaxy-user] Barcode splitter on paired end data Hi Veranja, I am going to try to address all questions in one go since they are all in the same thread. Next time though, it would be best send new questions as a brand new question, not as a reply with just the subject line changed. This helps us greatly with tracking and other users when searching prior posts. In the first email you seemed to have some trouble with the format of your custom reference genome, but later in the second email this seems to be resolved, at least as far as format is concerned (SAM->BAM conversion is possible using this genome, in Galaxy?). I am going to point you to our help for custom reference genomes, and if you click through to the main page there is a table with detailed format troubleshooting help. But, I will tell you first that I do not believe that this is going to be helpful for your overall goals, if I am understanding correctly. But, here is the link: http://wiki.galaxyproject.org/Support#Custom_reference_genome Your reference genome sounds as if it is not really a reference genome but instead more of a collection of short read sequences? If this number is very large, and the sequences are very short, you will likely run into memory or related indexing problems with many tools. There really isn't an easy way around this. You could try taking the analysis to a cloud version of Galaxy and scaling up the memory to see if that helps. You also might try breaking the job up into smaller jobs - you mentioned that the data is from multiple genomes - perhaps split by genome. But you will have to test this - I don't know the actual profile of your data. I can let you know that using purely a short read dataset, in particular one that has redundancy, will be problematic, likely no matter what is attempted. Some assembly or other strategy is likely required to move forward. Galaxy CloudMan: http://usegalaxy.org/cloud For the last question, different tools are probably expected to vary a bit in the results since they use a different method. If you want to compare datasets, using identifiers would be a good way. Convert the files to tabular, cut out the identifiers, compare these to find differences, then adjust the tabular files as needed, and convert back to fastq/fasta. Tools to do these sorts of functions are in the tool groups "Text Manipulation", "FASTA manipulation", "Filter and Sort, and Join", "Subtract and Group", "NGS: QC and manipulation". I know that seems like a lot of places to look - but use the tool search at the top of the tool panel and search by data type or tool name to make finding these easier, for example "Cut" or "Join" or "Tabular" - these tools have the names you would probably expect them to have and tool help is directly on each form. Our 101 tutorial also would be a good introduction for an overview: https://main.g2.bx.psu.edu/u/aun1/p/galaxy101 Hopefully this gives you some helpful information to work with, Jen Galaxy team Dear all, The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo /galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ -- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org/

ADD REPLY • link written 5.6 years ago by Veranja Liyanapathirana • 70

Similar posts • Search »