| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jandrewrogers 1513 days ago

People severely underestimate the velocity and volume of data implied by this if you actually did it, never mind having to run analytics processes at the same scale alongside it. We are talking about bespoke state-of-the-art data infrastructure platforms. You can't support anything like this with open source software, not that it stops people from trying. Glorified data brokers are usually not bastions of world-class software engineering, and if they were, they wouldn't be in such a low margin business. Most of these companies are just recycling the same low-quality and stale data sets.

The question I always ask, when evaluating companies making these types of claims, is "what hard computer science problems did you solve to make this possible". If you've actually done what is claimed like in the above, it will be an interesting list. In practice, this question usually elicits confusion.

There are legions of dubious companies making claims like this, which you can safely ignore. Their data quality is so poor that they would have difficulty violating most peoples' privacy even if they wanted to. The couple orgs with the technical expertise to actually pull it off competently don't talk about it.

1 comments

throwaway-blaze 1513 days ago

Actually, off the shelf open source and commercial data processing and analysis software can be made to do this stuff, if you're willing to spend $$ on AWS or similar infrastructure. This kind of analysis is easily partitionable, and while you're right that it would be hard to do this for all 3bn phones all the time, it's relatively straightforward to identify hundreds or thousands of candidate devices and then do the needed analysis across a huge data set.

(Built and sold a Data Broker, it's not a low margin biz btw, we had >85% gross margins because of how cheap the source raw data is, and we were doing double-digit millions $$ in revenue).

jandrewrogers 1513 days ago

There is open source software that can do online indexing across space/time/entity and cross-domain entity resolution in real-time on petabytes of telemetry data per day? The data never stops flowing and it isn't trivially partitionable in an analytically useful way -- graph-like joins figure prominently. It isn't a problem you can solve simply by throwing hardware at it.

I've been paid to characterize the quality of this type of data from well-known brokers by discerning customers, and in my experience it is all varying degrees of rubbish from which almost no generalizable insights are possible. Even with data that is much higher quality than what you can get from app SDKs and adtech, ground-truthing experiments show that it requires very sophisticated analytics to build something resembling a generalizable model from which insights can be reliably extracted.

I know there is a lot of money in selling data like this but it is essentially a scam, promising insights that aren't really derivable given the data provided. The data quality has also become much worse over time, for a variety of reasons, which is arguably a bigger limitation these days than lack of a suitable platform.