| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by davmre 4634 days ago
	They don't mean anything in particular. The actual analysis is being done in a high-dimensional space, in which each post is represented by a high-dimensional vector of the form [0,0,1,0,...., 0,1,0]. The length of the vector is the total number of distinct words used across all blog posts (maybe something like 30,000), and each entry is either 0 or 1 depending on whether the corresponding word occurs in this post. All the distances and cluster centers are actually being computed in this 30000-dimensional space; the two-dimensional visualization is just for intuition. If you're wondering how the author came up with the two-dimensional representation, the article doesn't say, but it's likely he used something like Principal Component Analysis (http://en.wikipedia.org/wiki/Principal_component_analysis). This is a standard technique for dimensionality reduction, meaning that it finds the "best" two-dimensional representation of the original 30,000-dimensional points, where "best" in this case means something like "preserves distances", so that points that were nearby in the original space are still relatively close in the low-dimensional representation.