| HN Mirror

Yes. You can see remnants of an old discussion of the point on the Wikipedia Talk page for "FASTA".

The problems comes to an ambiguity in the Pearson's original FASTA distribution, from http://faculty.virginia.edu/wrpearson/fasta/fasta3/ . In my copy of FASTA (fasta-35.1.5) in fasta20.doc is the following:

    0 Pearson/FASTA (>SEQID - comment/sequence)
       ...
    Standard library files.  These are the same as plain sequence
    files, each sequence is preceded by a comment line with a '>'
    in the first column.
       ...
    I have included several sample test files, *.AA.  The first
    line may begin with a '>'  or ';' followed by a comment.  The
    text after ';' in other lines will  be  ignored.   Spaces  and
    tabs  (and anything else that  is  not  an amino-acid code) are
    ignored.
    ...
    This is often referred to as "FASTA" or "Pearson" format. You
    can build your own library by concatenating several sequence
    files.  Just be sure that each sequence is preceded by a line
    beginning with a '>' with a sequence name.

For reference, this is the content of h10_human.aa:

    >H10_HUMAN | 90538   | HISTONE H1' (H1.0) (H1(0)).
    TENSTSAPAAKPKRAKASKKSTDHPKYSDMIVAAIQAEKNRAGSSRQSIQKYIKSHYKVGENADSQIKLSIKRLV
    TTGVLKQTKGVGASGSFRLAKSDEPKKSVAFKKTKKEIKKVATPKKASKPKKAASKAPTKKPKATPVKKAKKKLA
    ATPKKAKKPKTVKAKPVKASKPKKAKPVKPKAKSSAKRAGKKK

None of the '.aa' files have an example of a line starting with ';'.

(Also, there are '.seq' records with DNA in them, so the comment about ignoring non-amino-acid codes is only referring to protein FASTA files.)

Obviously the text after '>' is important, while the ignorable text after a ';' is not ... unless it's the first line of the file. If so, what name should be used to distinguish between one and another? The code calls the first line a title, for example.

Some people implemented ';' as a generic comment field, to be ignored. (See that Talk page for a couple of examples; "read.fasta in the seqinR package and by the function readFASTA in the Biostrings package") Most others did not.

After Pearson came NCBI. They describe a backwards compatible subset of the original FASTA at http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml . It calls the '>' line a "description line (defline)". The NCBI toolkit further break up the record into "sequence, description, and identifiers" (see http://www.ncbi.nlm.nih.gov/books/NBK21097/ ).

BioPerl and Biopython, and I assume the other Bio* languages, follow NCBI's lead and use the same, or similar, names.

I am a member of the NCBI FASTA camp, not the Pearson FASTA camp, so when I see the term "comment" I think it unambiguously refers to text after a ';' line in a Peason file. I can see how someone from a different intellectual heritage would call the description a "comment", but as most of the world uses NCBI FASTA and not Pearson FASTA, I think it's a bit confusing to do so.