Hacker News new | ask | show | jobs
by dredmorbius 1089 days ago
Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts.

The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification.

I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years.

The top-20 classifications (by story) are:

     1  64777  36.21%  UNCLASSIFIED
     2  22481  12.57%  blog
     3  15106   8.44%  general news
     4  13769   7.70%  tech news
     5  12709   7.10%  programming
     6   8459   4.73%  academic / science
     7   8200   4.58%  corporate comm.
     8   7294   4.08%  n/a
     9   5311   2.97%  business news
    10   3798   2.12%  general interest
    11   2151   1.20%  social media
    12   2048   1.14%  software
    13   1613   0.90%  technology
    14   1432   0.80%  video
    15   1144   0.64%  general information (wiki)
    16   1006   0.56%  government
    17    724   0.40%  misc documents
    18    720   0.40%  law
    19    702   0.39%  tech discussion
    20    620   0.35%  science news
I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time.