|
|
|
|
|
by dwattttt
275 days ago
|
|
The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect. Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8. Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect. Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that. |
|