Hi Andrew! Looking into similar requirements I came across Feather and fst. They both basically let you efficiently slice into compressed on-disk DataFrames. Feather already supports Python, but fst is just for C++ and R at the moment.
Thanks Jeremy. I looked at Feature, and the underlying Arrow format, but couldn't figure out how to implement a row-id-label-to-row-index dictionary. The only "dictionary" I saw was for categorical data.
I had the same issue with fstlib - how do I handle id lookups?
However fst didn't offer a raw/binary column datatype last I checked, which is frustrating. It has chr (string) but R can't have embedded NUL bytes in strings, so that was a dead-end for efficient storage of binary fingerprints for R. I didn't check if the underlying fstlib structures accept NULs in string columns.
I had the same issue with fstlib - how do I handle id lookups?
Any pointers?