Hacker News new | ask | show | jobs
by jffry 721 days ago
I think we've done something similar in our kernels, because I've likewise struggled to squeeze more than ~18GH/sec from a rented 4090. I think the design of SHA256 makes it hard to avoid many of the loop iterations. It's possible there are some fun GPU specific instructions to parallelize some of the adds or memory accesses to make the kernel go faster.

If you limit your variable portion to a base16 alphabet like A-P, radix encoding would just be `nibble + 0x41`. Sure you're encoding 4x as many values, but with full branch predictability. You'd have to experimentally determine if that performs any better.

You could also do something theoretically clever with bitmasking the 4 nibbles then adding a constant like 0x4141 or 0x00410041 or something (I'm too tired to think this through) and then you can encode twice as many per op assuming 32-bit bitwise ops are similar in speed to 16-bit bitwise ops.

Anyways this has been a cool challenge. You might also enjoy hunting for amulets - short poems whose sha256 hash contains a substring that's all 8: https://news.ycombinator.com/item?id=26960729

1 comments

SHA256 is designed as such that the maximum amount of data that can be contained within a single block is 440 bits (55 bytes.)

If you carefully organize the nonce at the end and use all 55 bytes, you can pre-hash the first ~20/64 rounds of state and the first several rounds of W generation and just base further iterations off of that static value (this is known as a "midstate optimization.")

> If you limit your variable portion to a base16 alphabet like A-P

The more nonce bits you decide to use, the less you can statically pre-hash.

In FPGA, I am using 64 deep, 8-bit-wide memories to do the alphabet expansion. I am guessing in CUDA you could something similar with `LOP3.LUT`.