|
|
|
|
|
by ddorian43
1866 days ago
|
|
> that I need to search fairly infrequently (but sometimes in bulk). What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ? > A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes. Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master. |
|
If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap