Hacker News new | ask | show | jobs
by dalke 2309 days ago
Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but couldn't figure out how to implement a row-id-label-to-row-index dictionary. The only "dictionary" I saw was for categorical data.

I had the same issue with fstlib - how do I handle id lookups?

Any pointers?

1 comments

Sorry, I thought Feather supported random row access but it turns out it only supports random column access.

For fst, I only played with the R interface, which would be called like this to retrieve row 12345 from the "fingerprint" column:

  read_fst("library.fst", columns="fingerprint", from=12345, to=12345)
However fst didn't offer a raw/binary column datatype last I checked, which is frustrating. It has chr (string) but R can't have embedded NUL bytes in strings, so that was a dead-end for efficient storage of binary fingerprints for R. I didn't check if the underlying fstlib structures accept NULs in string columns.
Thanks for confirming that. Looks like I'll need to look elsewhere for the next chemfp fingerprint file format.