Actually, in the Bio* langs that would be referred to as the 'description'. @dalke is referring to the very rarely-used ';' I believe, which I don't recall any modern-day FASTA parser supporting.
Yes. You can see remnants of an old discussion of the point on the Wikipedia Talk page for "FASTA".
The problems comes to an ambiguity in the Pearson's original FASTA distribution, from http://faculty.virginia.edu/wrpearson/fasta/fasta3/ . In my copy of FASTA (fasta-35.1.5) in fasta20.doc is the following:
0 Pearson/FASTA (>SEQID - comment/sequence)
...
Standard library files. These are the same as plain sequence
files, each sequence is preceded by a comment line with a '>'
in the first column.
...
I have included several sample test files, *.AA. The first
line may begin with a '>' or ';' followed by a comment. The
text after ';' in other lines will be ignored. Spaces and
tabs (and anything else that is not an amino-acid code) are
ignored.
...
This is often referred to as "FASTA" or "Pearson" format. You
can build your own library by concatenating several sequence
files. Just be sure that each sequence is preceded by a line
beginning with a '>' with a sequence name.
For reference, this is the content of h10_human.aa:
None of the '.aa' files have an example of a line starting with ';'.
(Also, there are '.seq' records with DNA in them, so the comment about ignoring non-amino-acid codes is only referring to protein FASTA files.)
Obviously the text after '>' is important, while the ignorable text after a ';' is not ... unless it's the first line of the file. If so, what name should be used to distinguish between one and another? The code calls the first line a title, for example.
Some people implemented ';' as a generic comment field, to be ignored. (See that Talk page for a couple of examples; "read.fasta in the seqinR package and by the function readFASTA in the Biostrings package") Most others did not.
BioPerl and Biopython, and I assume the other Bio* languages, follow NCBI's lead and use the same, or similar, names.
I am a member of the NCBI FASTA camp, not the Pearson FASTA camp, so when I see the term "comment" I think it unambiguously refers to text after a ';' line in a Peason file. I can see how someone from a different intellectual heritage would call the description a "comment", but as most of the world uses NCBI FASTA and not Pearson FASTA, I think it's a bit confusing to do so.
Same here! I did a pretty extensive search about 8 years ago for real-world examples, and no one could point to any, outside of test code to make sure that a given tool could handle these mythical files should they appear.
Hah I wouldn't worry it was a by-hand prepared sequence file of the sequence for a known structure I was emailed. The comment line had some interesting PDB data encoded :)
The problems comes to an ambiguity in the Pearson's original FASTA distribution, from http://faculty.virginia.edu/wrpearson/fasta/fasta3/ . In my copy of FASTA (fasta-35.1.5) in fasta20.doc is the following:
For reference, this is the content of h10_human.aa: None of the '.aa' files have an example of a line starting with ';'.(Also, there are '.seq' records with DNA in them, so the comment about ignoring non-amino-acid codes is only referring to protein FASTA files.)
Obviously the text after '>' is important, while the ignorable text after a ';' is not ... unless it's the first line of the file. If so, what name should be used to distinguish between one and another? The code calls the first line a title, for example.
Some people implemented ';' as a generic comment field, to be ignored. (See that Talk page for a couple of examples; "read.fasta in the seqinR package and by the function readFASTA in the Biostrings package") Most others did not.
After Pearson came NCBI. They describe a backwards compatible subset of the original FASTA at http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml . It calls the '>' line a "description line (defline)". The NCBI toolkit further break up the record into "sequence, description, and identifiers" (see http://www.ncbi.nlm.nih.gov/books/NBK21097/ ).
BioPerl and Biopython, and I assume the other Bio* languages, follow NCBI's lead and use the same, or similar, names.
I am a member of the NCBI FASTA camp, not the Pearson FASTA camp, so when I see the term "comment" I think it unambiguously refers to text after a ';' line in a Peason file. I can see how someone from a different intellectual heritage would call the description a "comment", but as most of the world uses NCBI FASTA and not Pearson FASTA, I think it's a bit confusing to do so.