Hacker News new | ask | show | jobs
by itunpredictable 1090 days ago
I should mention that I clicked on every single link to see the contents before classifying it, which is part of what made this so tedious
2 comments

So, some further thoughts on your methodology:

- It's comprehensive. That's ... admirable, but not necessarily efficient in data analysis. There's a lot to be said for both random sampling and inference.

- You might get more mileage by looking at the top-n stories of a given day. I'd suggest 3--5 items. There's a considerable fall-off in activity from storypos 1 to storypos 30 (1st to 30th items on the front page archive), which is one of the dimensions I've looked at.

- The thought that's occurred to me over the past few days is that this seems like a natural area in which LLM / GPT techniques might be used to classify posts given training data.

- Tuple and ngram analysis can also turn up interesting patterns. Here it's useful to have a base corpus from which universal tendencies can be inferred, and to look at statistically improbably terms which occur both from the HN subject corpus to the universal corpus (terms and phrases which HN finds significant), as well as changing trends over time within the HN corpus.

- Day-of-week and month-of-year analysis can also show interesting patterns, and I've looked at a bit of the first. I'd really like to know if there's an HN "September" (on an annual basis).

- I took a look at your data and ... spreadsheets. Maybe I'm old-school, but flatfiles and gawk are really my style.

There's a thin line between dedication and mania.

I'd probably manually classify domains by topic.

The top 100 domains appear at least 138 times each.

Domains appearing >= 100 times are 149.

The top 500 domains appear >= 35x each. (Number 500 is a personal fave, lowtechmagazine.com).

The top 1,000 sites, >= 17x each.

14,676 sites appear more than once.

37,966 sites appear only once.

25% of FP stories come from 31 sites appearing 400+ times each.

50% of FP stories come from 331 sites appearing 51+ times each.

75%: 2,521 sites, 7+ times.

90%: 7,749 sites, 3+ times.

95%: 11,173 sites, 2+ times.

99%: 13,992 sites, 2+ times.

Pick the degree of completeness you want (your 5% "misc" would require classifying slightly more than 11,000 sites).

I'd probably aim for 50--75% coverage.

OK, while writing this, I've classified about 10,200 (of 52,642) domains. (most of the first 300 manually, a bunch of the rest based on regexes, e.g., .edu, .gov, blogspot, medium.com, substack.com domains, etc.).

By site:

     1   7621  software
     2   1710  blog
     3    535  academic / science
     4    123  government
     5     41  general news
     6     34  ???
     7     31  corporate comm.
     8     30  tech news
     9     15  general interest
    10     10  business news
    11      8  law
    12      6  technology
    13      4  social media
    14      3  corporate comm
    15      3  general magazine
    16      2  general information
    17      2  science news
    18      2  tech discussion
    19      2  video
    20      1  business education
    21      1  corporate comm. 
    22      1  corporate commm.
    23      1  general discussion
    24      1  health news
    25      1  images
    26      1  law 
    27      1  legal news
    28      1  misc
    29      1  n/a
    30      1  podcast
    31      1  tech blog
    32      1  tech law
    33      1  tech publications
    34      1  technology / security
    35      1  translation
    36      1  videos
    37      1  webcomic
  
  Unclassified: 42442

By story count ...

     1  13782  general news
     2  13398  software
     3  10473  tech news
     4   8677  blog
     5   7651  academic / science
     6   7294  n/a
     7   4750  ???
     8   4600  business news
     9   3546  corporate comm.
    10   1504  general magazine
    11   1291  general information
    12   1162  general interest
    13   1132  technology
    14   1099  videos
    15   1073  social media
    16    975  government
    17    568  corporate comm
    18    559  tech discussion
    19    505  tech law
    20    251  tech publications
    21    171  tech blog
    22    170  science news
    23    136  business education
    24    104  corporate comm. 
    25    103  video
    26     99  corporate commm.
    27     96  general discussion
    28     80  misc
    29     71  technology / security
    30     61  law 
    31     59  webcomic
    32     49  translation
    33     48  health news
    34     47  images
    35     46  podcast
    36     32  law
    37      7  legal news
  
  Unclassified: 93213
'???' indicates I couldn't (quickly) assess a domain. Examples: 37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.

Note that I'm classifying by site rather than story, so an NY Times item on, say, quantum computing, would fall under "general news".

Also, very quick ad hoc code here, there are assuredly errors (and I've already fixed a few in stealth edits to this comment).

Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts.

The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification.

I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years.

The top-20 classifications (by story) are:

     1  64777  36.21%  UNCLASSIFIED
     2  22481  12.57%  blog
     3  15106   8.44%  general news
     4  13769   7.70%  tech news
     5  12709   7.10%  programming
     6   8459   4.73%  academic / science
     7   8200   4.58%  corporate comm.
     8   7294   4.08%  n/a
     9   5311   2.97%  business news
    10   3798   2.12%  general interest
    11   2151   1.20%  social media
    12   2048   1.14%  software
    13   1613   0.90%  technology
    14   1432   0.80%  video
    15   1144   0.64%  general information (wiki)
    16   1006   0.56%  government
    17    724   0.40%  misc documents
    18    720   0.40%  law
    19    702   0.39%  tech discussion
    20    620   0.35%  science news
I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time.
For those interested in the Ongoing Saga of HN Front Page Analyticcs, I've been posting occasional updates to the above site-based classification (~60% of posts now classified) to the Fediverse: <https://toot.cat/@dredmorbius/tagged/HackerNewsAnalytics>

(It's a bit much to dump massive tables to HN, I'm trying to keep that to a bearable minimum.)