Hello Galaxy Developers.
We have come across an issue that is quite significant to our users.
The Preview mode chunks are duplicating rows when viewing large datasets.
We see duplicate lines in the preview that are not in the dataset when downloaded or used as input to other tools, they seem to be because the chunks used to load large data to the dataset preview overlap by 1 line. (the last in one is the first in the next)
I have an example file that reliably reproduces this issue if you require it. We have tried this on our own instances and on usegalaxy and found the same issue.
I'd appreciate any information you may have on this issue. Kind regards Jo
It appear that the size of large files are increase when they are uploaded to Galaxy. Do you also see the difference in size of file from original upload?
Hello,
Would you please share a link to the history that contains the example where display is overlapping/problematic? Send to galaxy-bugs@list.galaxyproject.org, from the same email used for your account, include the dataset number and a link to this post, and please be sure that the data is in an active state (not deleted) or it cannot be fully viewed/tested.
Thanks and we will investigate. Jen, Galaxy team
Thank you. I have sent this information to you.
Many thanks. Jo
Hello Jennifer,
I was wondering if you could confirm that you received the mail I sent containing the details you requested?
Many thanks Jo
I'm having trouble finding the email in the galaxy-bugs internal list, could you resend it or send to me directly? I'd like to take a look at this one.
I didn't look specifically at the size of the file, rather the number of lines. My file in Galaxy was correct (downloading it was identical to the source), however the display showed more lines than it stated were present in the metadata.
Joanna, the number of lines in the preview is an estimation. Counting lines for very large file is expensive so Galaxy guesses the amount of lines. For small files it should be accurate.
Hi Bjoern, We found that the line count shown was consistant with our input files so this estimation was fine, it was that some lines are duplicated in the display. So copying and pasting the file from the preview display does not produce a reliable copy of the data. It displays more lines than are truely there.