Hacker News new | ask | show | jobs
by kevin_nisbet 2535 days ago
I don't think it's necessarily impossible to calculate. Using probabilistic data structures arranged in a clever way, it's likely possible to calculate with some degree of accuracy.

I haven't thought this through, but take all the queries as they're made and create a bloom filter for every hour of searches. Depending when this process was started, an analytics group could then take a day of unique searches, and run them against this probabilistic history, and get a reasonable estimation with low error. Although the people who work on this sort of thing probably know it far better than I.

The real question though might be assuming the 15% is right, do we care about those 15%, are they typo's that don't merge, are they semantically different, are they bots search for dates or hashes, etc.