- It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.
- You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.
- The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.
- Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.
- Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).
- I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.
The top 100 domains appear at least 138 times each.
Domains appearing >= 100 times are 149.
The top 500 domains appear >= 35x each. (Number 500 is a personal fave, lowtechmagazine.com).
The top 1,000 sites, >= 17x each.
14,676 sites appear more than once.
37,966 sites appear only once.
25% of FP stories come from 31 sites appearing 400+ times each.
50% of FP stories come from 331 sites appearing 51+ times each.
75%: 2,521 sites, 7+ times.
90%: 7,749 sites, 3+ times.
95%: 11,173 sites, 2+ times.
99%: 13,992 sites, 2+ times.
Pick the degree of completeness you want (your 5% "misc" would require classifying slightly more than 11,000 sites).
I'd probably aim for 50--75% coverage.
OK, while writing this, I've classified about 10,200 (of 52,642) domains. (most of the first 300 manually, a bunch of the rest based on regexes, e.g., .edu, .gov, blogspot, medium.com, substack.com domains, etc.).
By site:
1 7621 software
2 1710 blog
3 535 academic / science
4 123 government
5 41 general news
6 34 ???
7 31 corporate comm.
8 30 tech news
9 15 general interest
10 10 business news
11 8 law
12 6 technology
13 4 social media
14 3 corporate comm
15 3 general magazine
16 2 general information
17 2 science news
18 2 tech discussion
19 2 video
20 1 business education
21 1 corporate comm.
22 1 corporate commm.
23 1 general discussion
24 1 health news
25 1 images
26 1 law
27 1 legal news
28 1 misc
29 1 n/a
30 1 podcast
31 1 tech blog
32 1 tech law
33 1 tech publications
34 1 technology / security
35 1 translation
36 1 videos
37 1 webcomic
Unclassified: 42442
By story count ...
1 13782 general news
2 13398 software
3 10473 tech news
4 8677 blog
5 7651 academic / science
6 7294 n/a
7 4750 ???
8 4600 business news
9 3546 corporate comm.
10 1504 general magazine
11 1291 general information
12 1162 general interest
13 1132 technology
14 1099 videos
15 1073 social media
16 975 government
17 568 corporate comm
18 559 tech discussion
19 505 tech law
20 251 tech publications
21 171 tech blog
22 170 science news
23 136 business education
24 104 corporate comm.
25 103 video
26 99 corporate commm.
27 96 general discussion
28 80 misc
29 71 technology / security
30 61 law
31 59 webcomic
32 49 translation
33 48 health news
34 47 images
35 46 podcast
36 32 law
37 7 legal news
Unclassified: 93213
Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts.
The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification.
I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years.
The top-20 classifications (by story) are:
1 64777 36.21% UNCLASSIFIED
2 22481 12.57% blog
3 15106 8.44% general news
4 13769 7.70% tech news
5 12709 7.10% programming
6 8459 4.73% academic / science
7 8200 4.58% corporate comm.
8 7294 4.08% n/a
9 5311 2.97% business news
10 3798 2.12% general interest
11 2151 1.20% social media
12 2048 1.14% software
13 1613 0.90% technology
14 1432 0.80% video
15 1144 0.64% general information (wiki)
16 1006 0.56% government
17 724 0.40% misc documents
18 720 0.40% law
19 702 0.39% tech discussion
20 620 0.35% science news
I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time.
For those interested in the Ongoing Saga of HN Front Page Analyticcs, I've been posting occasional updates to the above site-based classification (~60% of posts now classified) to the Fediverse: <https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>
(It's a bit much to dump massive tables to HN, I'm trying to keep that to a bearable minimum.)
- It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.
- You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.
- The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.
- Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.
- Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).
- I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.