| I thought about that while back, how would I build an index on top of Durable Objects? For sorted indexes like a b-tree in a database, I think you would partition into objects by value, so (extremely naive example) values starting with a-m would be in one object, and n-z in the second. You'd end up needing a metadata object to track the partitions, and some reasonably complicated ability to grow and split partitions as you add more data, but this is a relatively mature and well-researched problem space, databases do this for their indexes. For full text search, particularly if you want to combine terms, you might have to partition by document, though. So you'd have N durable objects which comprise the full text "index", and each would contain 1/N of the documents you're indexing, and you'd build the full text index in each of those. If you searched for docs containing the words "elasticsearch" and "please" you would have to fan out to all the partitions and then aggregate responses. You could go the other way, and partition by value again, but that makes ANDs (for example) more challenging, those would have to happen at response aggregation time in some way. You'd do the stemming at index time and at search time, like Solr does. I have no idea what the documents per partition would be; it would probably depend on the size of the documents, and the number of documents, and the amount you'll be searching them, since each durable object is single-threaded. Adding right truncation or left+right will blow up the index size, so that would probably drive up the partition count. You might be better off doing trigrams or something like it at that point but I'm not as familiar with those. This is where optimizing would be hard. I don't think you can get from Durable Objects the kind of detailed CPU/disk IO stats you really need to optimize this kind of search engine data structure. |
You’d also need to manage the Lucene Segments or Solar/ElasticSearch Shard Metadata in Workers KV. You’d need a pool of Workers that are Coordination Nodes, another pool as “Data Nodes / Shards” and a non-Workers pool creating and uploading Lucene segments to R2.
It shouldn’t be so hard to do actually. Cloudflare would need more granular knobs for customers to fine tune the R2 replication to be collocated with the Worker execution locations so it’s really fast).