|
|
|
|
|
by jefftk
275 days ago
|
|
The FASTA format looks like: > title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively. |
|
Or, for an even simpler example:
becomes, on disk, something like which is hard to compress, while is just and then, if you want, you can reflow the text when it's time to render to the screen.