|
|
|
|
|
by dredmorbius
1089 days ago
|
|
Having played with classifying sites for much of the past day, I've assigned a classification to just under 30% of them, which classifies just under 64% of all posts. The remaining unclassified sites average about 1.7 posts each (there are a few with as many as 20 posts), but there are minimal gains for additional classification. I'm starting now with running an analysis over the full archive to come up with trends-by-classification over years. The top-20 classifications (by story) are: 1 64777 36.21% UNCLASSIFIED
2 22481 12.57% blog
3 15106 8.44% general news
4 13769 7.70% tech news
5 12709 7.10% programming
6 8459 4.73% academic / science
7 8200 4.58% corporate comm.
8 7294 4.08% n/a
9 5311 2.97% business news
10 3798 2.12% general interest
11 2151 1.20% social media
12 2048 1.14% software
13 1613 0.90% technology
14 1432 0.80% video
15 1144 0.64% general information (wiki)
16 1006 0.56% government
17 724 0.40% misc documents
18 720 0.40% law
19 702 0.39% tech discussion
20 620 0.35% science news
I've got a total of 60 classifications which ... seems a bit high, and I'm looking at ways of slimming that down. It's also a bit confused, as some is classified by topic ("programming", "networking" "database", "cryptocurrency", "crowdfunding"), some by source ("corporate comm." is any post that originates from an identifiable company communicating as that company), and general format ("blog" includes 5,306 sites, and spans a wide range of topics). The distinction between, say, "tech news" and "blog" is somewhat ambiguous, and there are a few blogs which should be classified as "corporate comms.". But in all there's a rough sense of what types of content are being posted, and I'd really like to see the change over time. |
|