| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bzbz 907 days ago

For anyone who's wondering, their estimation method works like so:

1. Assume a range of values

2. Assume a fair probability function for sampling over the range of values

The estimated size is the %-of-hits * the total range of values.

1 comments

Keyframe 907 days ago

I skimmed through the article but that's a lot of assumptions there if so.

1. So let's say that possible range of values is true (10 characters of specific range + 1). That would represent one big circle of possible area where videos might be.

2. Distribution of identifiers (valid videos) is everything. If Youtube did some contraints (or skewing) to IDs, that we don't know about, then actual existing video IDs might be a small(er) circle within that bigger circle of possibilities and not equally dispersed throughout, or there mught be clumping or whatever... So you'd need to sample the space by throwing darts in a way to get a silhouette of their skew or to see if it's random-ish, by I don't know let's say Poisson distribution.

Only then one could estimate the size. So is this what they're doing?

Also.. anyone bothered to you know, ask Youtube?

link

cbolton 907 days ago

No the distribution doesn't matter at all. I've given an extreme example here: https://news.ycombinator.com/item?id=38742735

link

Keyframe 906 days ago

I see what you did there. So basically an overlapped proportion (or hits proportion) would be overlapping hits divided by samples run, and then an estimated total would be this proportion divided by total space of possibilities. That would work.

link

remus 907 days ago

Video IDs are generated by hashing a a secret identifier, so they should be uniformly distributed.

link