| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ddorian43 1866 days ago

> that I need to search fairly infrequently (but sometimes in bulk).

What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?

> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.

Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.

Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

2 comments

hungnv 1866 days ago

> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.

If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

link

ddorian43 1866 days ago

> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap

My whole point was not building it ;)

There's also https://github.com/NVIDIA/aistore

link

natpat 1866 days ago

> What do you mean by search ?

Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.

I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.

I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.

link

ddorian43 1866 days ago

How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block

link