| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xashor 2124 days ago

In J (which might be slower than K) with excessive comments for a one-liner:

    echo@> (2{ARGV) -.&([: <;._1 LF, 1!:1) (3{ARGV)
    NB.    2nd arg                          3rd arg
    NB.              g&f execute f for each, then g on both
    NB.                              1!:1 read file
    NB.                          LF, prepend newline
    NB.                 [: <;._1 split based on first char
    NB.             -.  remove right elements from left array
    NB. echo@> echo each line
    exit 0

On two ~1.6MB files with ~15k lines (both the same except 3) I had lying around:

    $ time j9 -c ./pseudo_grep.ijs test_b test_a
    …
    real   0m0.064s
    user   0m0.032s
    sys    0m0.017s
    $ time grep -vf test_b test_a
    …
    real   0m5.815s
    user   0m5.234s
    sys    0m0.576s

Note that most of the script is for loading each file into an array of lines. Most work is done by -. on the two arrays, which is exactly what you asked for, e.g. 0 1 2 3 4 -. 2 4 is 0 1 3. https://code.jsoftware.com/wiki/Vocabulary/minusdot#dyadic

2 comments

DylanDmitri 2124 days ago

In loopless Python:

    set(open('file_b')) - set(open('file_a'))

Slower than J by a factor of 2-3, but still 10x faster than grep:

    real    0m0.128s
    user    0m0.078s
    sys     0m0.063s

This would make a good Rosetta Code prompt.

link

throwaway_pdp09 2123 days ago

I didn't know you could simply open a file and setify it. Interesting. & neat.

link

fennecfoxen 2123 days ago

You can setify any iterable. File handles are iterables that return a line at a time. Tada!

link

throwaway_pdp09 2123 days ago

> File handles are iterables...

I did not know that. Assumed you had to somehow wrap them first. Very useful, thanks!

link

qmmmur 2124 days ago

this is cheeky, I like it

link

dunefox 2123 days ago

This is only fast because it hits C underneath, isn't it?

link

kbenson 2124 days ago

That grep is not doing the same thing as the code, nor necessarily what the exercise requires. By default, grep tests patterns, so it's turning all those entries into individual regular expressions. You want to use fgrep, or the -F flag to make it treat all the source matches as fixes strings.

In my simple test, that resulting in grep running in 44% of the prior amount of time it required (still more than python though).

link

1vuio0pswjnm7 2124 days ago

Apologies for the careless omission. I tested the difference on a larger job; with grep 28s, with fgrep 22s.

link

kbenson 2123 days ago

I think theres probably a sweet spot in how large the files are compared to the method used, because eventually disk access may dominate the running time. Putting files on a ram disk (/dev/shm on some distros) would help.

I tested with files just over 2 MB on a small Digital Ocean VM. Depending on disk speed, based on running time I suspect you ran on files at least an order of magnitude larger. What time did python run in for those? Seeing memory usage from time might be illuminating for these tasks too. Using 4x the disk size in memory is fine for a couple MB file, but less so for a couple GB file (in which case creating a bloom filter or trie might be better, but I really have no idea if Pythons set functions do that already).

link