Metagenomics

6.7 years ago by

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hi Vincent, Scott, Filtering raw hits is an important part of a metagenomics analysis pipeline. Please see the methods described in the published metagenomics analysis paper associated with this tool set: Koskovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G, Chung W, Taylor J, and Nekrutenko A. "Windshield splatter analysis with the Galaxy metagenomic pipeline". Genome Research. 2009 Nov; 19(11):2144-53. http://www.ncbi.nlm.nih.gov/pubmed/19819906 Live supplemental data that can be imported and experimented with is available on the public instance, including raw data, working histories, and a tutorial that demonstrates step-by-step the exact methods used in the publication: http://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter http://main.g2.bx.psu.edu/library -> see "Windshield splatter" Not all tools are available on the public main server, but a local or cloud instance could be used with wrapped tools from the Distribution or Tool Shed, as necessary. For example, BLAST is not available on the public instance, but is included in the distribution for use in local or cloud instances. http://getgalaxy.org Hopefully you will both find this helpful, Jen Galaxy project -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support

ADD COMMENT • link written 6.7 years ago by Jennifer Hillman Jackson ♦ 25k

A small warning re-the current cloud-Blast+ config. To properly use the metagenomic tools, if you use the blast+ galaxy tool, make sure to export in blast.XML, then you'll need a script to parse out the readID and the Hit_def (as the hit ID). It appears that the 'Hit_def' field contains the correct key to the taxonomy database. Specifically, the Hit_def field is in the format #_#, where the 'gi' id is the first #. The tabular (normal and extended) data does not contain this info. I noticed this after attempting to use the tabular data, and using a trimmed col[1] (supposed to be hit seqID), but my results always came back as a ranked list of the most sequenced genomes in nt.... basically keying in randomly. j

ADD REPLY • link written 6.7 years ago by John Major • 40

Dear GALAXY and Jennifer Although the windshield analysis papers were good starters, They do not address conversed sequence purging or how to get at this information. If anyone has an automated approach I'd be interested . [Discard sequences from blast that have more then 4 hit >99%] Scott Scott Tighe Advanced Genome Technology Lab Vermont Cancer Center at the University of Vermont 149 Beaumont Avenue Health Science Research Bd RM 305 Burlington Vermont USA 05405 lab 802-656-AGTC (2482) cell 802-999-6666

ADD REPLY • link written 6.7 years ago by Scott Tighe • 200

Hi Scott, There isn't a specific tool to do this filtering in one step, but tools similar to those used in the in the Windshield analysis can be used again. Starting with " Parse blast XML output" results (this tool is on the Galaxy main server), calculate percent coverage (of the query) and percent identify using " <http: main.g2.bx.psu.edu="" root="" tool_menu#="">Text Manipulation -> Compute" from the output. Then, once you have the query, percent identify, and percent coverage, the data can be filtered any way that you would like using tools in "Text Manipulation", "Filter and Sort", and "Join, Subtract and Group". You will likely want to start with a "Filter and Sort -> Select" step to subset the data to be only those alignments that you consider part of your conserved criteria (for example: >99% identity and >90% coverage). On that result, count up the occurrence of each query identifier using "Join, Subtract and Group -> Group". Next, use "Select" again to isolate only those identifiers with the frequency (4?) that you choose as part of your conserved criteria. This result will be your list of identifiers for conserved sequences. As a final step, remove all hits associated with these conserved sequences from the original BLAST output. Using the tool "Join, Subtract and Group -> Compare two Datasets", set dataset 1 to be the original BLAST output and dataset 2 to be the list of conserved sequences (from the above processing). The columns for both will be sequence identifiers, and the option will be "To find:" -> "Non Matching rows of 1st dataset". There are likely other ways to do this same procedure, and any process that you work out could be put into a workflow for later use. Hopefully this process work for you or leads you to a process that does for your particular analysis. The tools in these groups can be combined in many ways to produce unique manipulations. Best wishes for your project, Jen Galaxy team

ADD REPLY • link written 6.7 years ago by Jennifer Hillman Jackson ♦ 25k

Hi John, Can you expand on that with a specific example (ideally on the galaxy- dev list, CC'd, since BLAST+ isn't event available on the public galaxy)? Also which version of BLAST+ are you using since I recall some changes to the tabular output IDs prior to 2.2.25 (which is what the wrappers were tested on, I've not tried 2.2.26 yet). Thanks, Peter

ADD REPLY • link written 6.7 years ago by Peter Cock • 1.4k

Hello I previously asked whether or not I could retrieve more information from "Fetching Taxonomic Representation" as in my summarized taxonomy I have results for just about every organism imaginable. Thus, the need to find out the percentage match for each of these results. Currently, the Megablast results give you alignment information but the "Fetch taxonomic represenation" gives you none and does not give you any information to match it with the megablast results. I appreciate the previous emails, but the comments and references do not address this problem. Thanks Vincent ________________________________________ To: Jennifer Jackson Cc: Montoya, Vincent; galaxy-user@bx.psu.edu Subject: Re: [galaxy-user] Metagenomics A small warning re-the current cloud-Blast+ config. To properly use the metagenomic tools, if you use the blast+ galaxy tool, make sure to export in blast.XML, then you'll need a script to parse out the readID and the Hit_def (as the hit ID). It appears that the 'Hit_def' field contains the correct key to the taxonomy database. Specifically, the Hit_def field is in the format #_#, where the 'gi' id is the first #. The tabular (normal and extended) data does not contain this info. I noticed this after attempting to use the tabular data, and using a trimmed col[1] (supposed to be hit seqID), but my results always came back as a ranked list of the most sequenced genomes in nt.... basically keying in randomly. j Hi Vincent, Scott, Filtering raw hits is an important part of a metagenomics analysis pipeline. Please see the methods described in the published metagenomics analysis paper associated with this tool set: Koskovsky Pond S, Wadhawan S, Chiaromonte F, Ananda G, Chung W, Taylor J, and Nekrutenko A. "Windshield splatter analysis with the Galaxy metagenomic pipeline". Genome Research. 2009 Nov; 19(11):2144-53. http://www.ncbi.nlm.nih.gov/pubmed/19819906 Live supplemental data that can be imported and experimented with is available on the public instance, including raw data, working histories, and a tutorial that demonstrates step-by-step the exact methods used in the publication: http://main.g2.bx.psu.edu/u/aun1/p/windshield-splatter http://main.g2.bx.psu.edu/library -> see "Windshield splatter" Not all tools are available on the public main server, but a local or cloud instance could be used with wrapped tools from the Distribution or Tool Shed, as necessary. For example, BLAST is not available on the public instance, but is included in the distribution for use in local or cloud instances. http://getgalaxy.org Hopefully you will both find this helpful, Jen Galaxy project Hello I am a relatively new user on Galaxy and I had a question regarding "Fetching Taxonomic Information". It is great that I can retrieve all of the hits for each sequence, but I cannot seem to find an option to also provide how accurate of a match it is to the given taxon. For instance, a percentage match. I can access this information in the original file and programmatically retrieve it but, it would be nice if it came in one package so that I can avoide those false hits that have a low percentage match. Can you please provide me with instructions on how to best to retrieve this information (hopefully in a single file)? Thank you Vincent ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org<http: usegalaxy.org="">. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org<http: usegalaxy.org="">. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

ADD REPLY • link written 6.7 years ago by Montoya, Vincent • 20

6.7 years ago by

Jennifer Hillman Jackson ♦ 25k

United States

Jennifer Hillman Jackson ♦ 25k wrote:

Hello Vincent, For "percentage match", there are two interpretations. 1) what percent of your total data matches a particular taxonomic group 2) what percent (coverage/identity) of a query sequence matches a target "hit", leading to a taxonomic assignment For #1: The other tools in the group "Metagenomic analyses" all accept "Fetch taxonomic representation" output as input, and produce various summary, graphical, and statistical information. Please give these tools a try for. On these tool forms, the "Fetch taxonomic representation" tool is referred to as "Taxonomy manipulation->Fetch Taxonomic Ranks", as I noted in my prior email. This is a legacy naming related to the prior publication, and we apologize if this still caused confusion. This should probably be updated, I will bring it up with the team. For #2: I sent instructions to Scott Tighe this morning with one example of how to use individual tools to select, sort, group, and filter data. http://lists.bx.psu.edu/pipermail/galaxy-user/2012-March/004349.html While the details for your analysis may differ, the basic tool set will probably be the same for your project. Filtering data by alignment quality prior to "Fetch Taxonomic Representation" was also part of the Metagenomics example in the publication we shared. The idea is to start with the parsed BLAST output, generate statistics, filter and group data based on those results, then go forward with Taxonomic assignments. There are no automated tools for this process, in a single step. Hopefully this helps to clear up the tool set, Best, Jen Galaxy team

ADD COMMENT • link written 6.7 years ago by Jennifer Hillman Jackson ♦ 25k

Similar posts • Search »