Hacker News new | ask | show | jobs
by tstactplsignore 4131 days ago
So, ungapped alignment is just substring searching with degeneracy, which is fine. While that is its own computational problem and may even have the occasional biological application, sequence alignment is defined as a gapped alignment problem (see the Wikipedia page, for example [1], which defines gap insertion as a critical step, and all alignment variants on the page are gapped alignments)

I am not sure you understand what gapped alignment is- it is not the alignment of a sequence with known gaps, but an algorithm which determines the best placement of gaps in a query sequence to obtain the highest matching score. This is a very different problem than the one you just described, and is essentially "the hard part" of sequence alignment. [1] http://en.m.wikipedia.org/wiki/Sequence_alignment

2 comments

See here: https://news.ycombinator.com/item?id=9096064

Turns out they aren't actually after an alignment per se, but rather trying to match a DNA binding motif.

(You're right though about this not being an alignment, rather just a substring search with degeneracy).

Hmm, while I understand the problem of gapping is traditionally the hard part, I'm under the impression that the argument you're putting forward is primarily one of semantics.

"In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences." (From the article you linked.)

Gapped sequence alignment is certainly far more robust (and biologically relevant - insertions are a common error - when comparing sequences across organisms) than ungapped, and a much harder problem, but as for the definition of "alignment" itself, I don't believe I've misnamed anything here.

If we're going to be overly pedantic about the use of the word "alignment" that's fine, but I'm not sure it's a worthwhile debate to have. A quick search for "ungapped sequenced alignment" returns a great deal of results on Google [1]. So if I am mistaken, I'm certainly not the first (nor do I believe I'll be the last.)

Furthermore, there's nothing preventing anybody from using the methods described here from implementing an ungapped sequence alignment tool that outperforms tools that only use string comparisons. :)

[1] https://www.google.com/search?q=ungapped%20sequence%20alignm....

you shouldn't dismiss this as an argument over semantics if your understanding of the term differs from researchers' use of the term. if you introduce "a fast tool for XYZ" and researchers understand XYZ to mean A, where you understand it to mean B, then the tool is not useful for researchers to perform what they know as XYZ.

tools like BLAST are extremely sophisticated and have been under development for decades, and I'm fairly confident they've moved past naive string comparisons by now.

Fair. Though I'm not convinced "ungapped sequence alignment" is particularly confusing to a researcher, considering there are tools and papers that have existed for decades using this description [1][2][3]. Though the algorithm described in my article is extremely focused on raw performance (and relatively naive with scoring), I would still choose to categorize it as primarily a tool that deals with ungapped sequence alignment, specifically supporting IUPAC degenerate nucleotide sequences. Thus, I believe the initial argument is, indeed, overly pedantic.

And to be clear, nowhere am I comparing what I've developed to BLAST. (They have very different applications.)

[1] http://schneider.ncifcrf.gov/paper/malign/

[2] http://www.ncbi.nlm.nih.gov/pubmed/9697204

[3] http://www.ncbi.nlm.nih.gov/pubmed/15130540