|
|
|
|
|
by skafoi
1773 days ago
|
|
That sounds mostly like deduplication which is often used in marketing contexts. There are indeed some good solutions out there, but from our experience they have difficulties handling huge amounts of data (>1 billion data sets) and they are often batch based, so your data is always outdated, whereas we constantly add new data in near real-time. |
|
The batch mode had naturally orders of magnitude higher throughput. We did have real-time single-record mode which was pretty fast as long as the stream of the incoming single-records wouldn't saturate the worker array capacity (here is the difference from serverless as the worker array was limited by whatever was statically configured at the moment as adding/removing nodes wasn't an instant on the fly operation)
Couple years later i worked at another company on a similar, though somewhat simpler, project when it was in the process of total rewrite for performance reason - the old version was really slow - that rewrite failed spectacularly for a lot of reasons. So, yes, performance is a kind of a noticeable factor in the domain.