| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bo1024 4995 days ago
	I wonder if the implementation of shuf would handle very large input efficiently? Reservoir sampling wouldn't need to keep the whole input in memory, which could be an advantage. But I don't know how shuf works.

1 comments

teraflop 4995 days ago

Doesn't look like it. I just tried running "yes | shuf -n 1" (using the latest version of GNU coreutils, 8.20) and its memory consumption increased steadily until I killed it.

It seems like this would be a really useful improvement, and I'm surprised that it doesn't already seem to have been requested on the coreutils issue tracker.

link

malcook 4994 days ago

did you try "yes | dimsum -n 1"?

In my hands, `top` shows resident memory increasing steadily too....

It is perhaps more instructive to compare output from, for example

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.dimsum.100000.out.%p dimsum -n 1

with

seq 1 1000000 | valgrind --time-unit=B --pages-as-heap=yes --trace-children=yes --tool=massif --massif-out-file=massif.shuf.100000.out.%p shuf -n 1

in my hands, shuf is faster and uses less memory for this task.

How about you?

link

snoble 4994 days ago

sigh, memory leak. It's fixed in github. When camilo is around I'll get him to update the gem

link

malcook 4994 days ago

thanks - looking forward to the patch

link

snoble 4994 days ago

try a `gem update`. Memory performance should be much better now but I'm still curious about speed

link

malcook 4994 days ago

Agreed it would be.

So, let's pursue this: http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079...

link