|
|
|
|
|
by burntsushi
3332 days ago
|
|
> Peeling off the front 8 or 16 elements should be no harder than peeling off the front 1, though naively you might have to handle each partial case from 1 to 15. I think I under-specified my requirement. :-) By "16 bytes at a time," I mean, "run a single CPU instruction on those 16 bytes." But yeah, I get your drift. I can see how it might be theoretically possible. I suppose the key gains might be in how much confidence a programmer can have that their code compiles down to the right set of instructions. |
|
In the long term I hope for a "nice" language (in particular one in which checked correctness is easy) with explicit performance semantics along the lines of http://www.cs.cmu.edu/~rwh/papers/iolambda/short.pdf . But yeah under the current state of the art there is no ideal approach, so there's a tradeoff in practice even though I don't think there should be one in theory.