|
|
|
|
|
by vessenes
612 days ago
|
|
Interesting. This is one of those areas where edge inference needs might be different than data center: to get an 11b quality model in 7b at the cost of 30% more inference time is probably a full yes for anyone doing local inference. And let’s remember that memory bandwidth is a huge factor as well; 30% smaller equals 30% boost in memory based time costs. Anyway I’m interested in trying this out. I wonder if the specific setup might be extra effective for coding tuned models as well - you get one coding transformer and one ‘bad habits/chat/other non coding stuff’ negative transformer. |
|