Question: How to filter FASTQ by QS & length using galaxy tools in the command line?
0
gravatar for hlyates
3.7 years ago by
hlyates0
United States
hlyates0 wrote:

Problem

I am stuck on how to implement filtering on the command line using the galaxy tools. Specifically, I want to do this for the command line. I want to use the following filtering parameters:

  • Input parameter: value
  • Minimum size: 40
  • Maximum size: 0
  • Minimum Quality: 30
  • Maximum Quality: 0
  • Maximum number of bases allowed outside of quality range: 5

Attempt

  • Obtain fastq_filter from devteam toolshed found here
  • Took a look at the sourcode fastq_filter.py, this is where I got stuck. Please see below:

    #Read command line arguments
    input_filename = sys.argv[1]
    script_filename = sys.argv[2]
    output_filename = sys.argv[3]
    additional_files_path = sys.argv[4]
    input_type = sys.argv[5] or 'sanger'

What the heck is the script_filename and additional_files_path? I see no where in the input for the values that the usegalaxy.org asks for? I would appreciate any assistance that would be rendered.

 

ADD COMMENTlink modified 3.7 years ago by Bjoern Gruening5.1k • written 3.7 years ago by hlyates0
1
gravatar for Bjoern Gruening
3.7 years ago by
Bjoern Gruening5.1k
Germany
Bjoern Gruening5.1k wrote:

Hi,

have a look at the corresponding xml file next to this script. It will tell you how Galaxy is using this script and with this you will know how you can use it. In particular: script_filename is the script that is used to filter... this is a Galaxy configfile dynamically constructed from the user input. additional_file_path is composite datatype specific attribute that is Galaxy specific, here are all files stored that do not belong to the main file ... usually a html file.

Cheers,

Bjoern

ADD COMMENTlink written 3.7 years ago by Bjoern Gruening5.1k

Thank you for your reply.The xml confirms what I read in the source code. Namely, that we have to use a line like this:

fastq_filter.py $input_file $fastq_filter_file $output_file $output_file.files_path '${input_file.extension[len( 'fastq' ):]}'

I am still stuck, I don't know how to make $fastq_filter_file and $output_file.files_path. I see the logic contained in xml as follows for $fastq_filter. Is this in python or what? How can I use this to create $fastq_filter_file? It doesn't make sense to me because all it does is have Boolean return values.

def fastq_read_pass_filter( fastq_read ):
    def mean( score_list ):
        return float( sum( score_list ) ) / float( len( score_list ) )
    if len( fastq_read ) < $min_size:
        return False
    if $max_size > 0 and len( fastq_read ) > $max_size:
        return False
    num_deviates = $max_num_deviants
    qual_scores = fastq_read.get_decimal_quality_scores()
    for qual_score in qual_scores:
        if qual_score < $min_quality or ( $max_quality > 0 and qual_score > $max_quality ):
            if num_deviates == 0:
                return False
            else:
                num_deviates -= 1
#if not $paired_end:
    qual_scores_split = [ qual_scores ]
#else:
    qual_scores_split = [ qual_scores[ 0:int( len( qual_scores ) / 2 ) ], qual_scores[ int( len( qual_scores ) / 2 ): ] ]
#end if
#for $fastq_filter in $fastq_filters:
    for split_scores in qual_scores_split:
        left_column_offset = $fastq_filter[ 'offset_type' ][ 'left_column_offset' ]
        right_column_offset = $fastq_filter[ 'offset_type' ][ 'right_column_offset' ]
#if $fastq_filter[ 'offset_type' ]['base_offset_type'] == 'offsets_percent':
        left_column_offset = int( round( float( left_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
        right_column_offset = int( round( float( right_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
#end if
        if right_column_offset > 0:
            split_scores = split_scores[ left_column_offset:-right_column_offset]
        else:
            split_scores = split_scores[ left_column_offset:]
        if split_scores: ##if a read doesn't have enough columns, it passes by default
            if not ( ${fastq_filter[ 'score_operation' ]}( split_scores ) $fastq_filter[ 'score_comparison' ] $fastq_filter[ 'score' ]  ):
                return False
#end for
    return True
 

 

ADD REPLYlink written 3.7 years ago by hlyates0

Can I simply ask why you want to use this Tool outside of Galaxy? If you want to run it from command line you can also use the Galaxy API to run this tool.

ADD REPLYlink written 3.7 years ago by Bjoern Gruening5.1k

My supervisor asked me to do it. I was able to groom and trim already. I have the source files of galaxy downloaded and I reference galaxy in an PYTHONPATH in the scripts I wrote. What makes fastq_filter not easy to use from the commandline is how it takes another script for an argument, but doesn't provide an easy way to make that script.

Seems overkill to use the api, and Dr.Google only turned up bitbucket sourcecode for the fastq_filter.py when I searched for fastq_filter galaxy api. This is the last galaxy resource I need to use before I can turn to other tools to continue my analysis.

Basically, I want to be able to run fastq_filter.py from the commandline and then run it in parallel on a bunch of input files I have. I hope this explains what I am doing better.

ADD REPLYlink written 3.7 years ago by hlyates0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 172 users visited in the last hour