How to filter FASTQ by QS & length using galaxy tools in the command line?

Question: How to filter FASTQ by QS & length using galaxy tools in the command line?

3.7 years ago by

United States

hlyates • 0 wrote:

Problem

I am stuck on how to implement filtering on the command line using the galaxy tools. Specifically, I want to do this for the command line. I want to use the following filtering parameters:

Input parameter: value
Minimum size: 40
Maximum size: 0
Minimum Quality: 30
Maximum Quality: 0
Maximum number of bases allowed outside of quality range: 5

Attempt

Obtain fastq_filter from devteam toolshed found here
Took a look at the sourcode fastq_filter.py, this is where I got stuck. Please see below:

    #Read command line arguments
    input_filename = sys.argv[1]
    script_filename = sys.argv[2]
    output_filename = sys.argv[3]
    additional_files_path = sys.argv[4]
    input_type = sys.argv[5] or 'sanger'

What the heck is the script_filename and additional_files_path? I see no where in the input for the values that the usegalaxy.org asks for? I would appreciate any assistance that would be rendered.

commandline qs fastq galaxy filtering • 1.3k views

ADD COMMENT • link •

modified 3.7 years ago by Bjoern Gruening ♦ 5.1k • written 3.7 years ago by hlyates • 0

3.7 years ago by

Bjoern Gruening ♦ 5.1k

Germany

Bjoern Gruening ♦ 5.1k wrote:

Hi,

have a look at the corresponding xml file next to this script. It will tell you how Galaxy is using this script and with this you will know how you can use it. In particular: script_filename is the script that is used to filter... this is a Galaxy configfile dynamically constructed from the user input. additional_file_path is composite datatype specific attribute that is Galaxy specific, here are all files stored that do not belong to the main file ... usually a html file.

Cheers,

Bjoern

ADD COMMENT • link written 3.7 years ago by Bjoern Gruening ♦ 5.1k

Thank you for your reply.The xml confirms what I read in the source code. Namely, that we have to use a line like this:

fastq_filter.py $input_file $fastq_filter_file $output_file $output_file.files_path '${input_file.extension[len( 'fastq' ):]}'

I am still stuck, I don't know how to make $fastq_filter_file and $output_file.files_path. I see the logic contained in xml as follows for $fastq_filter. Is this in python or what? How can I use this to create $fastq_filter_file? It doesn't make sense to me because all it does is have Boolean return values.

def fastq_read_pass_filter( fastq_read ):
    def mean( score_list ):
        return float( sum( score_list ) ) / float( len( score_list ) )
    if len( fastq_read ) < $min_size:
        return False
    if $max_size > 0 and len( fastq_read ) > $max_size:
        return False
    num_deviates = $max_num_deviants
    qual_scores = fastq_read.get_decimal_quality_scores()
    for qual_score in qual_scores:
        if qual_score < $min_quality or ( $max_quality > 0 and qual_score > $max_quality ):
            if num_deviates == 0:
                return False
            else:
                num_deviates -= 1
#if not $paired_end:
    qual_scores_split = [ qual_scores ]
#else:
    qual_scores_split = [ qual_scores[ 0:int( len( qual_scores ) / 2 ) ], qual_scores[ int( len( qual_scores ) / 2 ): ] ]
#end if
#for $fastq_filter in $fastq_filters:
    for split_scores in qual_scores_split:
        left_column_offset = $fastq_filter[ 'offset_type' ][ 'left_column_offset' ]
        right_column_offset = $fastq_filter[ 'offset_type' ][ 'right_column_offset' ]
#if $fastq_filter[ 'offset_type' ]['base_offset_type'] == 'offsets_percent':
        left_column_offset = int( round( float( left_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
        right_column_offset = int( round( float( right_column_offset ) / 100.0 * float( len( split_scores ) ) ) )
#end if
        if right_column_offset > 0:
            split_scores = split_scores[ left_column_offset:-right_column_offset]
        else:
            split_scores = split_scores[ left_column_offset:]
        if split_scores: ##if a read doesn't have enough columns, it passes by default
            if not ( ${fastq_filter[ 'score_operation' ]}( split_scores ) $fastq_filter[ 'score_comparison' ] $fastq_filter[ 'score' ] ):
                return False
#end for
    return True

ADD REPLY • link written 3.7 years ago by hlyates • 0

Can I simply ask why you want to use this Tool outside of Galaxy? If you want to run it from command line you can also use the Galaxy API to run this tool.

ADD REPLY • link written 3.7 years ago by Bjoern Gruening ♦ 5.1k

My supervisor asked me to do it. I was able to groom and trim already. I have the source files of galaxy downloaded and I reference galaxy in an PYTHONPATH in the scripts I wrote. What makes fastq_filter not easy to use from the commandline is how it takes another script for an argument, but doesn't provide an easy way to make that script.

Seems overkill to use the api, and Dr.Google only turned up bitbucket sourcecode for the fastq_filter.py when I searched for fastq_filter galaxy api. This is the last galaxy resource I need to use before I can turn to other tools to continue my analysis.

Basically, I want to be able to run fastq_filter.py from the commandline and then run it in parallel on a bunch of input files I have. I hope this explains what I am doing better.

ADD REPLY • link written 3.7 years ago by hlyates • 0

Similar posts • Search »