I didn't post it because it's quite big (150M) but readily available from the NCBI Virus portal [1]. I would love to see how well other languages compete both for speed and simplicity.
I couldn't get your 150M file, so I used one of the smaller files I could get by clicking on the first set shown in the table (the FASTA file was only 30KB) and duplicated it until it was around 150MB.
So, almost as fast as Nim (the time includes compilation time)?
Here's the Common Lisp code:
(with-open-file (in "nc_045512.2.fasta")
(loop for line = (read-line in nil)
while line
with gc = 0 with total = 0 do
(unless (eql (aref line 0) #\>)
(loop for i from 0 below (length line)
for ch = (char line i) do
(setf total (1+ total))
(when (or (eql ch #\C) (eql ch #\G))
(setf gc (1+ gc)))))
finally (format t "~f~%" (/ gc total))))
With a top-level function and some type declarations it could run even faster, I think.
EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.
The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).
@brabel - The Nim compiler actually builds a relatively large `system` package every time. (They are also working on speeding up compiles.) So, compile time does not scale as badly as you think. E.g., you might have to 50..100x the "user level" source code to double the time.
Also, @benjamin-lee this version of the Nim program is a bit lower level, but probably much faster:
import memfiles as mf
var gc = 0
var total = 0
var f = mf.open("orthocoronavirinae.fasta")
for line in memSlices(f):
let n = line.size
let cs = cast[cstring](line.data)
if n > 0 and cs[0] == '>': # ignore comment lines
continue
for i in 0 ..< n:
let letter = cs[i]
if letter == 'C' or letter == 'G':
gc += 1
total += 1
echo(gc.float / total.float)
mf.close(f) # not really needed; process about to end
Compile with -d:danger and so on, of course. { On a small 30kB test file I got about a 1.7x speed-up over that of the blog post. I also could not find the 150 MB file. Multiplying up the tiny 30 KB file like @brabel, I got only a 1.25x speed-up down to 0.5 seconds. So, might not be worth the low levelness, but a real file might tilt more towards the 1.7x end. }
I'm sorry, I completely forgot that the file I used was from six months ago when I wrote the blog post (and then promptly forgot to publish it). In the last half year, the number of coronavirus sequences has increased dramatically. One thing that you could do to drop the file size down is to filter for only complete and unambiguous sequences, which drops the number down from 1.6 million to ~100k [1].
Alternatively, the exact file I used for the post is available for one week here with MD5 sum 3c33c3c4c2610f650c779291668450c9 [2]. Anyone who wants the file is free to reach out to me directly (email is on site).
Here's a comparison with Common Lisp:
~/fasta-dna $ time python3 run.py
0.3797277865097147
21.828 secs
~/fasta-dna $ time sbcl --script run.lisp
0.37972778
2.415 secs
~/fasta-dna $ ls -al nc_045512.2.fasta
-rw-r--r-- 1 156095639 2021-09-25 11:15 nc_045512.2.fasta
So, almost as fast as Nim (the time includes compilation time)?
Here's the Common Lisp code:
With a top-level function and some type declarations it could run even faster, I think.EDIT: compiling the Lisp code to FASL and annotating the types brings the total runtime to 2.0 seconds. Running it from source increases the time very slightly, to 2.08 seconds, showing how the SBCL compiler is incredibly fast. Taking 0.7 seconds to compile a few lines of code is crazy, imagine when your project grows to many thousands of lines.
The Lisp code still can't really match Nim, which is really C at runtime, in speed when excluding compile-time, but if you need a scripting language, CL is great (specially when used with the REPL and SLIME).