Hacker News new | ask | show | jobs
by nullc 4946 days ago
uhhh. You mean like shuf -n NNNN ?
2 comments

I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.
Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.

It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.

did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1

in my hands, shuf is faster and uses less memory for this task.

How about you?

sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem
thanks - looking forward to the patch
try a `gem update`. Memory performance should be much better now but I'm still curious about speed
Agreed it would be.

So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...

or "sort -R"...
sort -R isn't actually a random permutation or shuffle like one would ordinarily expect from a 'random' output: if you read the man page very carefully, you'll notice it only speaks of 'random hash' which means that 2 lines which are identical (quite possible and common) will be hashed to the same value and placed consecutively in the output.

(This bit me once. Filed a bug about maybe clarifying the man page, but nope, apparently we're all supposed to recognize instantly the implication of a random hash.)