Much better memory usage and (for us at least) better concurrency. We could only run 4 unicorn workers on a single dyno. But with Puma we run 16 threads with ease.
Have you tried running 3 or even 4 puma workers (each with 8-16) threads on a dyno? That way you can get more concurrency on CPU-bound requests in addition to IO-bound concurrency (assuming MRI).