| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eadmund 237 days ago

> splitting a line into words is a whole project on its own

Is it[1]? My version below accumulates alphabetical characters until it encounters a non-alphabetical one, then increments the count for the accumulated word and resets the accumulator.

    (let (c accumulator (counts (make-hash-table :test #'equal)))
      (handler-case
          (loop
            (setq c (read-char))
            (if (find c "ABCDEFGHIJKLMNOPQRSTUVWXYZ" :test #'char-equal)
                (push (char-downcase c) accumulator)
                (when accumulator
                  (incf (gethash (coerce (reverse accumulator) 'string) counts 0))
                  (setq accumulator nil))))
        (end-of-file ()
          (when accumulator
            (incf (gethash (coerce (reverse accumulator) 'string) counts 0))
            (setq accumulator nil))
          (maphash #'(lambda (word count)
                       (push (list count word) accumulator))
                   counts)
          (format t "~{~&~{~a ~a~}~%~}" (reverse (last (sort accumulator #'< :key #'car)
                                                       (let ((n (second sb-ext:*posix-argv*)))
                                                         (if n (parse-integer n) 100))))))))

It’s not exactly pretty or idiomatic, but its 19 lines appear to get the job done.

1: Well, technically it is, because there is SPLIT-SEQUENCE: https://github.com/sharplispers/split-sequence

1 comments

kragen 237 days ago

Hey, this is great! Thanks!

It does look a lot like what I was thinking would be necessary. About 9 of the 19 lines are concerned with splitting the input into words. Also, I think you have omitted the secondary key sort (alphabetical ascending), although that's only about one more line of code, something like

  #'(lambda (a b)
       (or (< (car a) (car b)) 
           (and (= (car a) (car b)) 
                (string> (cadr a) (cadr b)))))

Because the lines of code are longer, it's about 3× as much code as the verbose Perl version.

In SBCL on my phone it's consistently slower than Perl on my test file (the King James Bible), but only slightly: 2.11 seconds to Perl's 2.05–2.07. It's pretty surprising that they are so close.

link

eadmund 236 days ago

Doh, I missed the secondary sort.

Were I trying to optimise this, I would test to see if a hash table of alphabetical characters is better, or just checking (or (and (char>= c #\A) (char<= c #\Z)) (and (char>= c #\a) (char<= c #\z))). The accumulator would probably be better as an adjustable array with a fill pointer allocated once, filled with VECTOR-PUSH-EXTEND and reset each time. It might be better to use DO, initializing C and declaring its type.

Also worth giving it a shot with (optimize (speed 3) (safety 0)) just to see if it makes a difference.

Yes, definitely more verbose. Perl is good at this sort of task!

link