| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jmuhlich 2308 days ago

Hi Andrew! Looking into similar requirements I came across Feather and fst. They both basically let you efficiently slice into compressed on-disk DataFrames. Feather already supports Python, but fst is just for C++ and R at the moment.

https://blog.rstudio.com/2016/03/29/feather/

http://www.fstpackage.org/ https://github.com/fstpackage/fstlib

2 comments

dalke 2308 days ago

Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but couldn't figure out how to implement a row-id-label-to-row-index dictionary. The only "dictionary" I saw was for categorical data.

I had the same issue with fstlib - how do I handle id lookups?

Any pointers?

link

jmuhlich 2307 days ago

Sorry, I thought Feather supported random row access but it turns out it only supports random column access.

For fst, I only played with the R interface, which would be called like this to retrieve row 12345 from the "fingerprint" column:

  read_fst("library.fst", columns="fingerprint", from=12345, to=12345)

However fst didn't offer a raw/binary column datatype last I checked, which is frustrating. It has chr (string) but R can't have embedded NUL bytes in strings, so that was a dead-end for efficient storage of binary fingerprints for R. I didn't check if the underlying fstlib structures accept NULs in string columns.

link

dalke 2303 days ago

Thanks for confirming that. Looks like I'll need to look elsewhere for the next chemfp fingerprint file format.

link

jdnier 2308 days ago

Latest info on Feather and related pyarrow: http://arrow.apache.org/docs/python/ipc.html#feather-format

link