|
|
|
|
|
by Ultimatt
3417 days ago
|
|
Unfortunately the regex/grammar engine is one of those components lacking deep optimisation. At the moment. Thats probably a much bigger factor than anything algorithmic. With tokens another nice thing to note is there is longest token matching and the concept of "proto" tokens and regexes. This lets you have simple decision making between similarly defined tokens without backtracking. For example the grammar I have for biological sequences can simultaneously identify and parse DNA/RNA/Protein without back tracking. Even if a file has a mixture of data I can instantiate the correct subclasses on the fly whilst parsing! https://github.com/MattOates/BioInfo/blob/master/lib/BioInfo... |
|
And yeah, proto regexes are pretty sweet, and they seem to be a natural fit for what you're doing. I'm always surprised by how popular Perl seems to be in biology / life sciences, and projects like BioInfo (and BioPerl, of course) are a great reminder as to why that happens to be.