Hacker News new | ask | show | jobs
by jmuhlich 2308 days ago
Hi Andrew! Looking into similar requirements I came across Feather and fst. They both basically let you efficiently slice into compressed on-disk DataFrames. Feather already supports Python, but fst is just for C++ and R at the moment.

https://blog.rstudio.com/2016/03/29/feather/

http://www.fstpackage.org/ https://github.com/fstpackage/fstlib

2 comments

Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but couldn't figure out how to implement a row-id-label-to-row-index dictionary. The only "dictionary" I saw was for categorical data.

I had the same issue with fstlib - how do I handle id lookups?

Any pointers?

Sorry, I thought Feather supported random row access but it turns out it only supports random column access.

For fst, I only played with the R interface, which would be called like this to retrieve row 12345 from the "fingerprint" column:

  read_fst("library.fst", columns="fingerprint", from=12345, to=12345)
However fst didn't offer a raw/binary column datatype last I checked, which is frustrating. It has chr (string) but R can't have embedded NUL bytes in strings, so that was a dead-end for efficient storage of binary fingerprints for R. I didn't check if the underlying fstlib structures accept NULs in string columns.
Thanks for confirming that. Looks like I'll need to look elsewhere for the next chemfp fingerprint file format.
Latest info on Feather and related pyarrow: http://arrow.apache.org/docs/python/ipc.html#feather-format