| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dredmorbius 1089 days ago

So, some further thoughts on your methodology:

- It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.

- You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.

- The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.

- Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.

- Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).

- I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.