Hacker News new | ask | show | jobs
by bo1024 4947 days ago
I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.
1 comments

Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.

It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.

did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1

in my hands, shuf is faster and uses less memory for this task.

How about you?

sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem
thanks - looking forward to the patch
try a `gem update`. Memory performance should be much better now but I'm still curious about speed
Agreed it would be.

So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...