| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dredmorbius 2432 days ago

Different classes of data compress differently.

For complex reason, human language (spoken and written) is about 50% redundant, across a wide range of independent languages.

Tabular data can be vastly more compressible, and I'd routinely see 90% or better compression across a range of datasets (mostly business, financial, and healthcare data). Data of highly random events might be somewhat less so.

Image, audio, and video data, when in codecs is already highly compressed. When you're working with raw (WAV, TIFF, BMP, RAW) datatypes, there's a huge opportunity for compression, but mp3, ogg, mp4, png, gif, jpg, etc., are pretty highly compressed. There's a distinction between lossy (jpg, mp3) and lossless (png, AALC) formats. You get smaller files with lossy formats, but you're actually losing some of the original data, whilst lossless codecs allow fully reconstruction of the original source image, audio, or video.

Your comment about simulated universes gets to a key philosophical question about information, truth, and models. Generally, any representation we have of the universe is at best an abstraction of it, and hence a small, lossy, model.

This needn't necessarily be the case:

https://en.wikipedia.org/wiki/On_Exactitude_in_Science