Hacker News new | ask | show | jobs
by test9753 2445 days ago
Beating Decades of Optimized C with 27 Lines of Ocaml:

  type t = { mutable words: int; mutable chars: int; mutable lines: int; mutable in_word: bool ; mutable finished: bool};;

  let () =
    match Core.Sys.argv with
    | [| prog_name |] -> Core.Printf.eprintf "Usage: %s file1 file2 ...\n" prog_name
    | _ -> (
      let args = Core.Array.slice Core.Sys.argv 1 @@ Core.Array.length Core.Sys.argv in
      let buf_size = 65536 in (* 64 KB -> Caml IO buffer size *)
      let buf = Core.Bytes.create buf_size in
      Core.Array.fold args ~init:() ~f:(fun _ file ->
        Core.In_channel.with_file file ~f:(fun in_ch ->
          let c = { words = 0; chars = 0; lines = 0; in_word = false; finished = false } in
          let set_words () = if c.in_word then c.words <- c.words + 1 in
          while not c.finished do
            let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size |> Core.Int.to_int in
            if len > 0 then (
              for i = 0 to (len - 1) do
                match (Core.Caml.Bytes.get buf i) with
                | ' ' | '\t' | '\r' -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false)
                | '\n'              -> (c.chars <- c.chars + 1; set_words (); c.in_word <- false; c.lines <- c.lines + 1)
                | _                 -> (c.chars <- c.chars + 1; c.in_word <- true)
              done
            ) else ( c.finished <- true )
          done;
          set_words ();
          Core.Printf.printf "%s -> lines: %d, words: %d, characters: %d\n" file c.lines c.words c.chars)))
  ;;
Testing:

  $ ls -lh test_file.txt 
  -rw-r--r--  1 user  group   508M Oct 16 12:01 test_file.txt

  $ time wc test_file.txt 
  4863460 54621760 532480000 test_file.txt

  real 0m3.368s
  user 0m3.137s
  sys  0m0.195s

  #Ocaml version:
  $ time ./wcl.native test_file.txt 
  test_file.txt -> lines: 4863460, words: 54621760, characters: 532480000

  real 0m2.480s
  user 0m2.345s
  sys  0m0.123s
7 comments

Oh this is fun! Beating Decades of Optimised C (best of three)

    $ time wc w.txt
     3156098 6312196 380648004 w.txt
    
    real 0m1.358s
    user 0m1.277s
    sys 0m0.066s
... with one line of q (best of three):

    q)\t g:{(sum[1<deltas where x]+sum[not x:max x],0),sum max x 0 1}0x0d0a2009=\:read1 `:w.txt
    783
    q)g
    380648004i
    6312196
    3156098i
783msec: almost twice as fast!
Won't this give an incorrect number of bytes for several consecutive whitespace characters? Why not using # to count the bytes instead?
in k, but with that same bug';

(b:#a;b++/,/a=/:" ";b+#,/a)

Here is one without the bug;

b:#a;c:,/a=/:" ";(b;b++/c@&~c&=':c;b+#,/a)

Now to see if I can check the performance.

We discussed this at length in the shakti mailing list [1]. This version by chrispsn is my favorite:

    {+/["\n"=x],+/[<':~x in "\n\r\t "],#x}
I think it's the one that more clearly express the intent. However, this other one by Attila Vrabecz performs better in the current version:

    {(+/x in"\n\n";+/0>':x in" \n\t\r";#x)}@1:"big.txt"
[1] https://groups.google.com/forum/?utm_medium=email&utm_source...
Nice, thanks for that!
That is quite beautiful. It may also have awoken Cthulhu, but that's a separate issue. :)
The system wc is multibyte friendly, the Ocaml version is not.
While I wouldn't be surpised that OCaml is fast, I'd like to be sure your disk cache was warm in both cases. I'd test it myself, but I don't have OCaml setup on my work computer.
I setup it on my computer, used it to count the lines of 250 files, it is indeed faster than my system's (ubuntu 18.04) wc, but it doesn't compute the total. I don't think this should change the timings much though.

A best of five run of the system's wc takes 0.042s, vs 0.028s for the OCaml version.

Nice.

I was going to say that it is unreadable, after spending a bit trying to understand it, (for someone that knows no ocaml) it actually is not. Only things I'm unsure are the ~ symbol and <-. What do they do?

edit: <- is just mutating assignment. I thought that ~ introduced named arguments but doesn't seem right.

~ does indeed denote named arguments. To call a function named f which has an argument named arg with arg set to the value x, the syntax is:

    f ~arg:x
But what about ~buf here:

  let len = Core.In_channel.input in_ch ~buf ~pos:0 ~len:buf_size
~buf is shorthand for ~buf:buf
Oh, nice.
A naive C version is also faster than the system wc.
Why was system wc implemented in a slower way than a naive C version?
Could somebody write this in Clojure? I'm too noob to do it myself. Will it be an order of magnitude slower?
Many languages will have counted a 1gb file before Clojure has even started up.
> Many languages will have counted a 1gb file before Clojure has even started up.

You mean before the JVM has started up. One could compile Clojure with GraalVM to eliminate startup time.

Not really, the initialization of the JVM is fast(er), but initializing the Clojure runtime took a lot of time. At least in 1.7, not sure if it got better in the meantime.

I could start and run Java programs way faster than just starting the Clojure REPL.