Hacker News new | ask | show | jobs
by cjfields1 4209 days ago
The FASTA parsers for the various Bio* all support additional characters; if not then that would definitely be a bug.

Saying that, 'support' completely depends on what the FASTA is and where it comes from (assembly, sequence database, alignment tool, etc), which is something the parser/grammar can't define up front from such a simple format.

Much of the problem comes when validating the sequence for a specific alphabet or symbol set. If the alphabet isn't explicitly defined (again impossible to determine from the format without guessing) then you can certainly run into problems.

This is also an issue when using FASTA for both regular sequence data and for alignments (e.g. length of the sequence would have to take into account possible gap characters generated from various tools like '-?.', or stops in protein seqs like '*').

These are tools that have been in widespread use for ~15 years, so if you have run into problems you should be more explicit in what they are so they can be addressed.

1 comments

BioPerl has probably got a lot better over the years, I haven't touched it in a long time (5 years+), so my views may be very out of date. I don't really want to get into reviewing Bio* libraries generally :)

There is a general problem I think though, and other people have experienced the same. I've been trying to reason why, but I think it's due to multiple factors. By the time you've managed to get that bit working reliably with BioPerl you could have written it yourself. And it would be faster and use less memory. BioPerl has the problem of trying to solve every case, whereas normally I'm working with a limited subset of possibilities.

So why? I think it's a combination of dodgy, manually hacked inputs (e.g PDB files!), the learning curve, poor backwards compatibility preventing upgrading, and, ah I don't know. Every time it's seemed like a good idea, and yet every time I've ended up abandoning BioPerl. Maybe it's too integrated with itself as a library of functions?

Take parsing a FASTA. By the time I've read the BioPerl documentation, I could have already written that one line split string statement (because I e.g. know in advance the sequence string is always on one line). It's hard to overcome that laziness and make the commitment to learn it and become fluent in it.