Hacker News new | ask | show | jobs
by cjfields1 4209 days ago
Actually, in the Bio* langs that would be referred to as the 'description'. @dalke is referring to the very rarely-used ';' I believe, which I don't recall any modern-day FASTA parser supporting.
2 comments

Yes. You can see remnants of an old discussion of the point on the Wikipedia Talk page for "FASTA".

The problems comes to an ambiguity in the Pearson's original FASTA distribution, from http://faculty.virginia.edu/wrpearson/fasta/fasta3/ . In my copy of FASTA (fasta-35.1.5) in fasta20.doc is the following:

    0 Pearson/FASTA (>SEQID - comment/sequence)
       ...
    Standard library files.  These are the same as plain sequence
    files, each sequence is preceded by a comment line with a '>'
    in the first column.
       ...
    I have included several sample test files, *.AA.  The first
    line may begin with a '>'  or ';' followed by a comment.  The
    text after ';' in other lines will  be  ignored.   Spaces  and
    tabs  (and anything else that  is  not  an amino-acid code) are
    ignored.
    ...
    This is often referred to as "FASTA" or "Pearson" format. You
    can build your own library by concatenating several sequence
    files.  Just be sure that each sequence is preceded by a line
    beginning with a '>' with a sequence name.
For reference, this is the content of h10_human.aa:

    >H10_HUMAN | 90538   | HISTONE H1' (H1.0) (H1(0)).
    TENSTSAPAAKPKRAKASKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLV
    TTGVLKQTKGVGASGSFRLAKSDEPKKSVAFKKTKKEIKKVATPKKASKPKKAASKAPTKKPKATPVKKAKKKLA
    ATPKKAKKPKTVKAKPVKASKPKKAKPVKPKAKSSAKRAGKKK
None of the '.aa' files have an example of a line starting with ';'.

(Also, there are '.seq' records with DNA in them, so the comment about ignoring non-amino-acid codes is only referring to protein FASTA files.)

Obviously the text after '>' is important, while the ignorable text after a ';' is not ... unless it's the first line of the file. If so, what name should be used to distinguish between one and another? The code calls the first line a title, for example.

Some people implemented ';' as a generic comment field, to be ignored. (See that Talk page for a couple of examples; "read.fasta in the seqinR package and by the function readFASTA in the Biostrings package") Most others did not.

After Pearson came NCBI. They describe a backwards compatible subset of the original FASTA at http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml . It calls the '>' line a "description line (defline)". The NCBI toolkit further break up the record into "sequence, description, and identifiers" (see http://www.ncbi.nlm.nih.gov/books/NBK21097/ ).

BioPerl and Biopython, and I assume the other Bio* languages, follow NCBI's lead and use the same, or similar, names.

I am a member of the NCBI FASTA camp, not the Pearson FASTA camp, so when I see the term "comment" I think it unambiguously refers to text after a ';' line in a Peason file. I can see how someone from a different intellectual heritage would call the description a "comment", but as most of the world uses NCBI FASTA and not Pearson FASTA, I think it's a bit confusing to do so.

Oh I've seen them used in the wild... I'd rather have a parser throw an exception than have that data go through though lol
Curious. Where have you run into them (so I can run the other way :)
Same here! I did a pretty extensive search about 8 years ago for real-world examples, and no one could point to any, outside of test code to make sure that a given tool could handle these mythical files should they appear.
Hah I wouldn't worry it was a by-hand prepared sequence file of the sequence for a known structure I was emailed. The comment line had some interesting PDB data encoded :)