Question: 6.9 MB file being truncated by 49 bytes
0
gravatar for anna.ceguerra
4.5 years ago by
Australia
anna.ceguerra0 wrote:

Hi,

We have configured our version of Galaxy to accept a certain type of binary file, however upload then download of the file results in the downloaded file having been truncated by 49 bytes.

Has anyone else had this type of issue before?

Thanks and regards,

Anna.

galaxy • 1.8k views
ADD COMMENTlink modified 4.5 years ago by Dannon Baker3.7k • written 4.5 years ago by anna.ceguerra0

Can you check the file size once uploaded to establish if the change happens on upload, download (or worse, do both damage the binary file)?

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Peter Cock1.4k

How do I check the uploaded file size?

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

Just wanted to bump this. Has anyone had this problem? It is stopping us from using our instance of Galaxy..

ADD REPLYlink written 4.5 years ago by anna.ceguerra0
2
gravatar for Dannon Baker
4.5 years ago by
Dannon Baker3.7k
United States
Dannon Baker3.7k wrote:

So, the short answer here is that the 'is_binary' check is failing to detect this particular file.  The first 100 characters all happen to be printable (see lib/galaxy/util/__init__.py: is_binary()  and lib/galaxy/datatypes/checkers.py: check_binary if you're really interested).  Since the upload method fails to detect it as binary (or as a datatype with to_posix_lines set to False), the line endings are automatically converted.

I'm not sure why this is even being attempted with the extension being manually set, but there may be a good reason I'm not aware of, so I'll look into it -- I don't see this having changed recently.  A really hacky fix for right now that results in the file being correctly detected is bumping up the temp.read(100) in check_binary to include more sample bytes from the file.

 

Edit* This should be resolved in the following commit that'll be in the next release:

https://bitbucket.org/galaxy/galaxy-central/commits/8b6e1ffaa053f87c944290f3a84e5f73633cd901

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Dannon Baker3.7k

Thanks very much. We'll patch our test server and let you know the result.

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

Great!  If this was helpful, please do remember to mark the answer as accepted so it bubbles to the top and is easier for others to find in the future.

ADD REPLYlink written 4.5 years ago by Dannon Baker3.7k

Hi,

I can't seem to access this site anymore. Is there an alternative URL?

Thanks.

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

Never mind, bitbucket was down temporarily. It works, thanks!

ADD REPLYlink written 4.5 years ago by anna.ceguerra0
0
gravatar for anna.ceguerra
4.5 years ago by
Australia
anna.ceguerra0 wrote:

How do I check the uploaded file size?

ADD COMMENTlink written 4.5 years ago by anna.ceguerra0

The easiest way to check the uploaded filesize is going to be to look at the dataset's information box as an admin user.  That will give you the full path to the dataset, at which point you can use whatever utility you're comfortable with (ls).  Once you're able to do that, please do test the things Peter asked for above.

ADD REPLYlink written 4.5 years ago by Dannon Baker3.7k

Thanks very much for the info. It is truncating during upload only, not during download.

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

Can you post how exactly you've configured your Galaxy to accept the binary file, and perhaps post a sample file?

ADD REPLYlink written 4.5 years ago by Dannon Baker3.7k

I've uploaded a datatypes_conf.xml & cvl.py files into a repository called 'apm_datatypes'. The type of file being truncated is the pos file. Where can I post my sample file?

<?xml version="1.0"?>
<datatypes>
  <datatype_files>
    <datatype_file name="cvl.py"/>
  </datatype_files>
  <registration>
    <datatype extension="pos" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="True"/>
<datatype extension="ft" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="True"/>
<datatype extension="rng" type="galaxy.datatypes.cvl:Rng" mimetype="text/plain" display_in_upload="True"/>
<datatype extension="xml" type="galaxy.datatypes.cvl:Xml" mimetype="text/plain" display_in_upload="True"/>
  </registration>
  <sniffers>
    <sniffer type="galaxy.datatypes.cvl:Pos"/>
<sniffer type="galaxy.datatypes.cvl:Ft"/>
<sniffer type="galaxy.datatypes.cvl:Xml"/>
  </sniffers>
</datatypes>

 

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

We’ve got a couple of small datasets that I’m tearing apart to analyse this problem (I’m Anna’s colleague). The data sets are big endian single-precision floats.

I read the two data files (before uploading and after uploading) into Matlab as unsigned 8-bit integers, hereafter referred to as bytes.

Bytes #327, #2268, #3179, #4767 take the value “13” are but are read as “10”, i.e. the word 00001101 turns into 00001010. This continues at locations which seem random but are not. Up to the first 55126th bytes, wherever there is a “13” it gets changed to “10”.

This misreading continues until byte #55127 at which point in the original file there is a “13” followed by “10” at byte #55128. Things get really crazy when these are both replaced by one byte reading “10”. All subsequent bytes are shifted back one in order, rendering the subsequent reading of single precision numbers useless.

I hope Galaxy is just being superstitious. Seriously though, this is an extremely odd problem which needs to be resolved for others to upload their data. I can’t think where to start on this one.

ADD REPLYlink written 4.5 years ago by leigh.stephenson0

If you don't have a place to host one, feel free to email a small sample file directly to galaxy-bugs (or me directly -- dannon.baker@gmail.com).

Do you have a link to the apm_datatypes repository referred to here?

ADD REPLYlink written 4.5 years ago by Dannon Baker3.7k
0
gravatar for d.benson
4.5 years ago by
d.benson0
Australia
d.benson0 wrote:

Hi Folks

I was asked by a colleague to take a look at this.

Has anyone considered that it could be related to the carriage return character (or end of record) for Windows and Unix?

On Windows it would be 0D0A in big endian hex.

On Unix it is just 000A for big endian hex. 

If this was UTF 16 you could pad in a few extra zeros.  Could explain why a 0D0A ends up a 0A if there is a translation going on between Win and Unix file formats.

Kind regards

Derek

ADD COMMENTlink written 4.5 years ago by d.benson0

Thanks Derek, that makes a lot of sense. But should it be doing this for binary files? Or have I configured Galaxy incorrectly?

ADD REPLYlink written 4.5 years ago by anna.ceguerra0

Made a toplevel answer reply below, but -- no, it shouldn't be doing this for binary files.  I don't think there's an issue with your configuration -- the problem is with the way galaxy detects binary files.

ADD REPLYlink written 4.5 years ago by Dannon Baker3.7k
0
gravatar for d.benson
4.5 years ago by
d.benson0
Australia
d.benson0 wrote:

Hi Anna

Given that you have specified:

<datatype extension="pos" type="galaxy.datatypes.binary:Binary" mimetype="application/octet-stream" display_in_upload="true" subclass="True"/>

I would think that galaxy should treat this as a binary file to be saved and not modify it. 

When you upload data are you able to select your data type in the upload file / File format pull down? or do you leave it at auto detect?  Auto detect may not be sufficient.

Derek 

 

 

 

ADD COMMENTlink written 4.5 years ago by d.benson0

The 'pos' file type is specified. Auto-detect doesn't work for us.

ADD REPLYlink written 4.5 years ago by anna.ceguerra0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 183 users visited in the last hour