| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adammil 4634 days ago
	It is nice to read about this in plain language. But, can someone explain what the X and Y axis are meant to represent in the graph?

5 comments

davmre 4634 days ago

They don't mean anything in particular. The actual analysis is being done in a high-dimensional space, in which each post is represented by a high-dimensional vector of the form [0,0,1,0,...., 0,1,0]. The length of the vector is the total number of distinct words used across all blog posts (maybe something like 30,000), and each entry is either 0 or 1 depending on whether the corresponding word occurs in this post. All the distances and cluster centers are actually being computed in this 30000-dimensional space; the two-dimensional visualization is just for intuition.

If you're wondering how the author came up with the two-dimensional representation, the article doesn't say, but it's likely he used something like Principal Component Analysis (http://en.wikipedia.org/wiki/Principal_component_analysis). This is a standard technique for dimensionality reduction, meaning that it finds the "best" two-dimensional representation of the original 30,000-dimensional points, where "best" in this case means something like "preserves distances", so that points that were nearby in the original space are still relatively close in the low-dimensional representation.

link

benofsky 4634 days ago

Sure, author here.

The point of the graphs are as an example more than anything — in any real world solution it's not trivial to plot your documents (as representing documents with thousands of words in 1d, 2d or 3d is not trivial).

But if _you could_ represent your documents in 2d, you could plot them like I do here. And, the X-axis would represent 1 feature of your documents, and the Y-axis another.

Tl;dr don't focus on the axes — focus on the idea of being able to compare the relative distances between documents.

link

hughlomas 4634 days ago

The illustration has 2 dimensions, x and y.

If we have x, y, z, i, j, and n then we'd have six dimensions.

If each word represents a dimension, any blog title with that word gets a 1 on that word dimension, or a 0 if does not.

You can then calculate the distance between these points considering all of the dimensions, although clearly it's going to be a little limited on binary inputs, since the square of the differences is always going to be 1 or 0. Like for 4 dimensions, sqrt( ( 0 - 1 )^2 + ( 0 - 0 )^2 + ( 0 - 0 )^2 + ( 1 - 0 )^2 ), which just always gets simplified to something like sqrt( 1 + 0 + 0 + 1 ), which is a little boring.

The binary values cause your data points to all be stacked directly on top of each other, which leads me to believe that using binary inputs is a less than ideal application for k-means. Just look at it in the 2d case, where you have either [0,1], [0,0], [1,0], or [1,1] for each data point. Not very hard to determine the clustering there... basically just doing an overly complex boolean expression.

link

bunderbunder 4634 days ago

They don't really represent anything.

Think of those graphs as being something like the rubber sheet analogy that's often used to explain general relativity - it's not a great analogy in all respects, but humans can't really visualize 4 dimensions and it captures the gist of the idea well enough. Machine learning models have a similar problem: In reality the data points in k-means clustering exist in some very high-dimensinal space that's impossible to draw. So for the sake of visual aids it's common to draw 2D plots where the only thing they're meant to suggest is that points that appear close together in the visual aid would also be fairly close to each other in the actual problem space.

link

nilkn 4634 days ago

The graphs are just illustrations. The actual graphs would be high-dimensional (one dimension or axis for each word in the total word set), and each axis would have only two ticks on it: 0 and 1. A post is mapped to the binary vector in this space whose coordinates are determined by the presence or absence of words in the post.

link