Hacker News new | ask | show | jobs
by jandrewrogers 1507 days ago
There is open source software that can do online indexing across space/time/entity and cross-domain entity resolution in real-time on petabytes of telemetry data per day? The data never stops flowing and it isn't trivially partitionable in an analytically useful way -- graph-like joins figure prominently. It isn't a problem you can solve simply by throwing hardware at it.

I've been paid to characterize the quality of this type of data from well-known brokers by discerning customers, and in my experience it is all varying degrees of rubbish from which almost no generalizable insights are possible. Even with data that is much higher quality than what you can get from app SDKs and adtech, ground-truthing experiments show that it requires very sophisticated analytics to build something resembling a generalizable model from which insights can be reliably extracted.

I know there is a lot of money in selling data like this but it is essentially a scam, promising insights that aren't really derivable given the data provided. The data quality has also become much worse over time, for a variety of reasons, which is arguably a bigger limitation these days than lack of a suitable platform.