| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pronoiac 507 days ago
	I remember tools that worked with the Wikipedia dumps, in bzip2, and built indexes to allow decent random access. Once you know where the compressed blocks are, and which Wikipedia entries they contain, you could start from a given block, something like 900k, rather than start at the beginning of the file. Compressing roughly a megabyte at a time, rather than a page, is a pretty solid win for compressibility.