Question: Use multiple inputs in same script
0
gravatar for hortowe
2.7 years ago by
hortowe10
hortowe10 wrote:

Hi all,

I have a batch of 170 samples (represented by 170 comma-separated files) and am attempting to have them all as inputs into a single R script that will produce 1 output using data from each sample. When I run this script outside of galaxy, the input is a directory of the files. I feel that the multiple datasets option is what I should be using, but I can't seem to figure out its structure. Any help is appreciated!

Here is an example of what the script would be doing outside of galaxy:

# Get files
arguments <- commandArgs(trailingOnly=TRUE);
directory <- arguments[1]
files <- list.files(directory)

# Read in 1st file to use to make matrix
temp.file <- paste(directory, files[1], sep="")
temp <- read.csv(temp.file)

# Get rows from file and columns from the number of files in the directory
num.rows <- length(temp[[1]])
num.cols <- length(files)

#Create and populate matrix
matrix.to.populate <- matrix(nrow = num.rows, ncol = num.cols)
for(i in 1:length(files){
populate matrix }

After the matrix is populated, I do a few calculations using the data.

I originally tried to use a dataset collection, but that wanted to run the script 170 times, one for each sample. I'm currently trying to use the multiple datasets input option (I found this post How to select multiples files as inputs to parse them simulaneously ? ), but can't get that to work. My xml file has the following commmand and input:

<command>R --vanilla --file=my.script.R --args $inputs $output </command>
<param name="inputs" type="data" multiple="True" />

I tried to change my original R script to accommodate this input, but have not been successful. The galaxy version of my R script looks like this:

inputs <- commandArgs(trailingOnly=TRUE)[1]
output <- commandArgs(trailingOnly=TRUE)[2]

# Get rows and columns for matrix
temp <- read.csv(inputs[1])
num.rows <- length(temp[[1]])
num.cols <- length(inputs)

The num.rows works, but the num.cols returns 1 instead of 170. It looks like only the first file in the multiple datasets field is being used. In my galaxy script,

read.csv(inputs[1])

is the same as

temp.file <- paste(directory, files[1], sep="")
temp <- read.csv(temp.file)

Trying to see what inputs looks like:

write.table(inputs, quote=FALSE, row.names=FALSE, col.names=FALSE)

returns

"~/galaxy/database/files/000/dataset_433.da"
galaxy • 1.1k views
ADD COMMENTlink modified 2.6 years ago • written 2.7 years ago by hortowe10

I have dificulties following your R code...but a few points:

In the galaxy code: why "read.csv(inputs[1])" and not read.csv(inputs) ? since 'input' is a path

"num.cols <- length(inputs)" will give you the length of the vector 'input'. 'input' is just the path and therefore a vector of length 1

ADD REPLYlink written 2.6 years ago by Hotz, Hans-Rudolf1.8k

As far as the R code goes:

  1. set a directory path as the argument

  2. Create a list of all of the files in that directory (csv's with 260 rows and 5 columns)

  3. Read in the first file

  4. Get the number of rows for the new matrix (260) from first file

  5. Get the number of columns for the new matrix (1 column for each file in the directory)

  6. Create an empty matrix with specified rows and columns

  7. Populate the matrix with values from the files

I'm essentially taking the 5th column from each of the csv's and then cbinding them together so that the final result will be 1 column of 260 values for each sample...hope that explained it better.

As far as the galaxy code, I was just experimenting with how to access all of the files in the multiple inputs. I can only ever access the first file. Both "read.cv(inputs[1])" and "read.csv(inputs)" give me the same result actually, the first file in the multiple dataset argument (5 columns of 260 rows).

I'm trying to find someway to refer to all of the files in the multiple inputs (i.e. if I have 170 samples, there will be 170 columns, if I have 140 samples, there will be 140 columns, etc.)

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by hortowe10
1
gravatar for hortowe
2.6 years ago by
hortowe10
hortowe10 wrote:

Figured out a solution that at least works for my specific case.

xml file:

<command>
R --vanilla --file=/path/to/file.R
--args $output $inputs
</command>

<inputs>
<param name="inputs" type="data" multiple="True"/>
</inputs>

<outputs>
<data name="output" format="tabular"/>
</outputs>

The inputs come in as a character string composed of all of the paths to each of the datasets selected in the multiple datasets input option on the galaxy interface, separated by commas. Split that on the commas and you get a list, and can reference each file by list[[1]][i]

Rscript:

output <- commandArgs(trailingOnly=TRUE)[1]
inputs <- commandArgs(trailingOnly=TRUE)[2]

# Separate multiple input files into a list of individual files
files <- strsplit(inputs, ',')

# Read in first file
temp.file <- files[[1]][1]
temp <- read.csv(temp.file)

# Make empty matrix
num.rows <- length(temp[[1]])
num.cols <- length(files[[1]])
empty.matrix <- matrix(nrow=num.rows, ncol=num.cols)

# Populate matrix by iterating over each file
for (i in 1:length(files[[1]])) {
    curr.file <- files[[1]][i]
    curr.data <- read.csv(curr.file)
    empty.matrix[,1] <- curr.data$column.I.want
}

# Write output
write.table(empty.matrix, file=output)
ADD COMMENTlink written 2.6 years ago by hortowe10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 169 users visited in the last hour