| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vintermann 1000 days ago
	Ideally, you'd take all the documents in each language, and compress them in turn with the unclassified text, to see which compresses it better. But this won't work very well with gzip, since it compresses based on a 32KB sliding window. You might as well truncate the training data for each class to the last 32KB (more or less). So to get any performance at all out of a gzip-based classifier, you need to combine a ton of individually quite bad predictors with some sort of ensemble method. (The linked code demonstrates a way of aggregating them which does not work at all).