Hacker News new | ask | show | jobs
by natpat 1866 days ago
This is super interesting. I've recently also been working on a similar concept: we have a reasonable amount (in the terabytes) of data, that's fairly static, that I need to search fairly infrequently (but sometimes in bulk). A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Random access of a file on S3 is pretty fast, and running in an EC2 instance means latency is almost nil to S3. Cheap, fast and effective.

We're using some custom Python code to build a Marisa Trie as our index. I was wondering if there were alternatives to this set up?

7 comments

You could look at AWS Athena, especially if you only query infrequently and can wait a minute on the search results. There are some data layout patterns in your S3 bucket that you can use to optimize the search. Then you have true pay-per-use querying and don't even have to run any EC2 nodes or code yourself.
> that I need to search fairly infrequently (but sometimes in bulk).

What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?

> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.

Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.

Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

My whole point was not building it ;)

There's also https://github.com/NVIDIA/aistore

> What do you mean by search ?

Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.

I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.

I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.

How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block
Combining data-at-rest with some slim index structure coupled with a common access method (like HTTP) was the idea behind a tool a key-value store for JSON I once wrote: https://github.com/miku/microblob

I first thought of building a custom index structure, but found that I did not need everything in memory all the time. Using an embedded leveldb works just fine.

There might be much better alternative but it really depends on the nature of your key.

Because the crux of S3 is the latency you can also decide to encode the docs in blocks, and retrieve more data than is actually needed.

For this demo, the index from DocID to offset in S3 takes 1.2 bytes per doc. For a log corpus, we end up with 0.2 bytes per doc.

You might want to check out Snowflake for something like this, it makes searching pretty easy, especially as it seems your data is semi-static? We use it pretty extensively at work and it's great.

For your usecase it'll be very cheap if you don't access it constantly (you can probably get away with the extra small instances, which you are billed per minute).

Not affiliated in anyway, just a suggestion.

Also check out Dremio with parquet files stored on S3
This is the kind of thing I value in Rails. Active storage [1] has been around for a few years and it solves all of this. All the metadata you care about is in the database - content type, file size, image dimensions, creation date, storage path.

[1] https://guides.rubyonrails.org/active_storage_overview.html