|
|
|
|
|
by robertritz
924 days ago
|
|
MLX only works for fp16 right now. If it ever works quantized I will almost certainly move my app over to MLX instead of llama.cpp. My app also uses a very small (30MB) PyTorch model and shipping it requires an extra 100MB for PyTorch in the app. Very very stupid. I think its important to remember that last mile inference is still pretty bespoke for most things. If we want to see gen AI stuff take off and now have the big cloud providers in charge this needs to be fixed. Apple is in a good place to solve at least part of the equation. |
|