| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by seydor 929 days ago

looks like they're too busy being awesome. i need a fake video to understand this!

What memory will this need? I guess it won't run on my 12GB of vram

"moe": {"num_experts_per_tok": 2, "num_experts": 8}

I bet many people will re-discover bittorrent tonight

2 comments

brucethemoose2 929 days ago

Looks like it will squeeze into 24GB once the llama runtimes work it out.

Its also a good candidate for splitting across small GPUs, maybe.

One architecture I can envision is hosting prompt ingestion and the "host" model on the GPU and the downstream expert model weights on the CPU /IGP. This is actually pretty efficient, as the CPU/IGP is really bad at the prompt ingestion but reasonably fast at ~14B token generation.

Llama.cpp all but already does this, I'm sure MLC will implement it as well.

link

syntaxing 929 days ago

BitTorrent was the craze when llama was leaked on torrent. Then Facebook started taking down all huggingface repos and a bunch of people transitioned to torrent released temporarily. llama 2 changed all this but it was a fun time.

link