Hacker News new | ask | show | jobs
by vessenes 612 days ago
Interesting. This is one of those areas where edge inference needs might be different than data center: to get an 11b quality model in 7b at the cost of 30% more inference time is probably a full yes for anyone doing local inference. And let’s remember that memory bandwidth is a huge factor as well; 30% smaller equals 30% boost in memory based time costs. Anyway I’m interested in trying this out.

I wonder if the specific setup might be extra effective for coding tuned models as well - you get one coding transformer and one ‘bad habits/chat/other non coding stuff’ negative transformer.