Hacker News new | ask | show | jobs
by sillywabbit 106 days ago
Like the other poster said, the rabbit hole continues with suffix arrays (https://en.wikipedia.org/wiki/Suffix_array#Space_efficiency), then compressed suffix arrays (https://en.wikipedia.org/wiki/Compressed_suffix_array).

Also explained by the creator of this: https://www.abahgat.com/project/suffix-tree/

> the human genome can be encoded as a 3GB string constructed out of an alphabet of four characters

> As of 2019, a suffix tree indexing the human genome using state of the art algorithms can easily occupy tens of gigabytes.