| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vdm 5058 days ago

Dup of Archify?

https://www.archify.com/

> 40% of searches online are people simply looking for what they have already seen before.

Citation link needed.

1 comments

vinnyglennon 5058 days ago

Citation link: http://cond.org/sigir07.pdf [PDF]

Information Re-Retrieval: Repeat Queries in Yahoo’s Logs

Abstract: "This paper explores repeat search behavior through the analysis of a one-year Web query log of 114 anonymous users and a separate controlled survey of an additional 119 volunteers. Our study demonstrates that as many as 40% of all queries are re-finding queries. Re-finding appears to be an important behavior for search engines to explicitly support, and we explore how this can be done."

link

lifeisstillgood 5058 days ago

Wow, does 240 people even count as a sample. At Yahoo and Google log sizes its probably the error from cosmic rays in the data center.

link

freshhawk 5057 days ago

If they selected them in a properly random way and had an effect close to 40% then yes, that probably does count as a sample.

link

lifeisstillgood 5057 days ago

As someone who signed up to coursera stats 101, err... Why 40%?

link

freshhawk 5057 days ago

I am making some assumptions here absolutely, but because 40% is a large effect you don't need as many samples to be confident.

The other way of looking at it is that maybe it's actually 35% or 45% but either way, that's still interesting, even with a rougher approximation of the actual "answer". If, for some reason, you needed to know if it was 40% or 40.01% because that mattered to you then you would absolutely be annoyed at the small sample size.

If the finding was 2% then we would care about the uncertainty of +/- 5% since the finding is dwarfed by the error rate. That's a smaller effect size so you would need more samples to separate reality from the noise.

I am, by the way, pulling all of these numbers out my ass. Your stats 101 class will teach you the formulas to calculate the actual error bars at work here as well as the assumptions you need to make about the distribution of the data to use those formulas.

link