Question: Creating A Galaxy Tool In R - "You Must Not Use 8-Bit Bytestrings"
0
Dan Tenenbaum • 20 wrote:
Hello,
I'm a galaxy newbie and running into several issues trying to adapt an
R script to be a galaxy tool.
I'm looking at the XY plotting tool for guidance
(tools/plot/xy_plot.xml), but I decided not to embed my script in XML,
but instead have it in a separate script file, that way I can still
run it from the command line and make sure it works as I make
incremental changes. (So my script starts with args <-
commandArgs(TRUE)). Also, if it doesn't work, this suggests to me that
there is a problem with my galaxy configuration.
First, I tried using the r_wrapper.sh script that comes with the XY
plotting tool, but it threw away my arguments:
An error occurred running this job: ARGUMENT
'/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat'
__ignored__
ARGUMENT '/Users/dtenenba/dev/galaxy-
dist/database/files/000/dataset_3.dat'
__ignored__
ARGUMENT 'Fly' __ignored__
ARGUMENT 'Tagwise' __ignored__
etc.
So then I tried just switching to Rscript:
Rscript RNASeq.R $countsTsv $designTsv
"$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2
(My script produces as output a csv file and a pdf file. The final two
arguments I'm passing are the names of those files.)
But then I get an error that Rscript can't be found.
So I wrote a little wrapper script, Rscript_wrapper.sh:
#!/bin/sh
Rscript $*
And called that:
Rscript_wrapper.sh RNASeq.R $countsTsv
$designTsv "$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2
Then I got an error that RNASeq.R could not be found.
So then I added the absolute path to my R script to the tag.
This seemed to work (that is, it got me further, to the next error),
but I'm not sure why I had to do this; in all the other tools I'm
looking at, the directory to the script to run does not have to be
specified; I assumed that the command would run in the appropriate
directory.
So now I've specified the full path to my R script:
Rscript_wrapper.sh
/Users/dtenenba/dev/galaxy-dist/tools/bioc/RNASeq.R $countsTsv
$designTsv "$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2
And I get the following long error, which includes all of the output
of my R script:
Traceback (most recent call last):
File "/Users/dtenenba/dev/galaxy-
dist/lib/galaxy/jobs/runners/local.py",
line 133, in run_job
job_wrapper.finish( stdout, stderr )
File "/Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/__init__.py",
line 725, in finish
self.sa_session.flush()
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/scoping.py",
line 127, in do
return getattr(self.registry(), name)(*args, **kwargs)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/session.py",
line 1356, in flush
self._flush(objects)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/session.py",
line 1434, in _flush
flush_context.execute()
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 261, in execute
UOWExecutor().execute(self, tasks)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 753, in execute
self.execute_save_steps(trans, task)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 768, in execute_save_steps
self.save_objects(trans, task)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 759, in save_objects
task.mapper._save_obj(task.polymorphic_tosave_objects, trans)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/orm/mapper.py",
line 1413, in _save_obj
c = connection.execute(statement.values(value_params), params)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/engine/base.py",
line 824, in execute
return Connection.executors[c](self, object, multiparams, params)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/engine/base.py",
line 874, in _execute_clauseelement
return self.__execute_context(context)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/engine/base.py",
line 896, in __execute_context
self._cursor_execute(context.cursor, context.statement,
context.parameters[0], context=context)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/engine/base.py",
line 950, in _cursor_execute
self._handle_dbapi_exception(e, statement, parameters, cursor,
context)
File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r649
8-py2.7.egg/sqlalchemy/engine/base.py",
line 931, in _handle_dbapi_exception
raise exc.DBAPIError.instance(statement, parameters, e,
connection_invalidated=is_disconnect)
ProgrammingError: (ProgrammingError) You must not use 8-bit
bytestrings unless you use a text_factory that can interpret 8-bit
bytestrings (like text_factory = str). It is highly recommended that
you instead just switch your application to Unicode strings. u'UPDATE
job SET update_time=?, stdout=?, stderr=? WHERE job.id = ?'
['2012-04-24 18:55:45.791417', '', 'BiocInstaller version 1.5.7,
?biocLite for help\nWarning message:\nNAs introduced by coercion
\nLoading required package: methods\nLoading required package:
limma\nLoading required package: BiasedUrn\nLoading required package:
geneLenDataBase\nLoading required package: org.Dm.eg.db\nLoading
required package: AnnotationDbi\nLoading required package:
BiocGenerics\n\nAttaching package:
\xe2\x80\x98BiocGenerics\xe2\x80\x99\n\nThe following object(s) are
masked from \xe2\x80\x98package:stats\xe2\x80\x99:\n\n xtabs\n\nThe
following object(s) are masked from
\xe2\x80\x98package:base\xe2\x80\x99:\n\n anyDuplicated, cbind,
colnames, duplicated, eval, Filter, Find,\n get, intersect, lapply,
Map, mapply, mget, order, paste, pmax,\n pmax.int, pmin, pmin.int,
Position, rbind, Reduce, rep.int,\n rownames, sapply, setdiff,
table, tapply, union, unique\n\nLoading required package:
Biobase\nWelcome to Bioconductor\n\n Vignettes contain introductory
material; view with\n \'browseVignettes()\'. To cite Bioconductor,
see\n \'citation("Biobase")\', and for packages
\'citation("pkgname")\'.\n\nLoading required package:
DBI\n\nCalculating library sizes from column totals.\nError in
matrix(u, nrow = nrows, byrow = TRUE) : \n negative extents to
matrix\nCalls: plotMDS.DGEList ... equalizeLibSizes -> splitIntoGroups
-> lapply -> FUN -> matrix\nExecution halted\n', 15]
Note that if I run my script from the command line:
./Rscript_wrapper.sh RNASeq.R
/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat
/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_3.dat Fly 1
1 Tagwise MDSPlot.pdf outputs.csv
It works fine and does not produce a warning about "NAs introduced by
coercion", nor does it fail with the "Error in matrix" above.
So, can anyone tell me what is going wrong here? Why does R behave
differently in galaxy than it does on the command line? (I'm using the
same instance of R, same machine, for my galaxy and command-line
efforts). Is this 8-bit bytestring error a red herring? Can I filter
it so that galaxy is happy?
Finally, one other curiosity. Every time I hit "Execute" in galaxy to
run my tool, it is run twice--two jobs are created (which each fail in
the same way). Why is this?
My R script:
My XML file:
I can share more data (such as sample input files) if necessary.
Thanks for your help.
Dan