Show HN: Speed up model inference on CPU with hand crafted layer implementations

Is the main idea to convert the model implementation from Python into C, then hardcode all possible values? Do you do this yourself in the generator code, or could you let the C preprocessor/compiler handle something like this by using macros? (might help with compile time/memory)

"NOTE: Ensure the device you are running on has no form of hardware acceleration like GPU or the results will be skewed"

How much does adding GPUs affect your performance improvement gains? I understand that the point of this optimization is for CPU-only machines, but it would be interesting to consider the affect your optimizations have when running on GPUs as well.