| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ydj 11 days ago

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

4 comments

arjie 11 days ago

That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.

Very interesting though, these Tenstorrent chips. Might get one to experiment with.

link

ydj 10 days ago

Yeah that’s definitely the smarter buy if you want to just have models running quickly. But the cost of 2 p150 and a 4090 was <$5000 for me.

The main issue is the immature software, and somewhat baroque way of writing kernels. Please, buy one and join us.

link

arjie 10 days ago

Were you able to connect the two P150 using the qsfp-dd cable? They only sell 4x and 8x topologies so I’m curious if that worked for you. Are you able to run them tensor parallel?

link

ydj 9 days ago

Yeah, I’m doing TP with two cards. The topology is configured based on yaml files, and if you are not using a predefined config you can just create a new config with your topology.

I’m not even using a 800G cable since they are expensive and I don’t think I need the bandwidth, opting for 400G instead. This just needs a config change for the number of Ethernet links it uses internally. (Apparently these cables are just many 200G links put together.)

link

arjie 9 days ago

Brilliant, thank you. Maybe I'll get a couple in a bit.

link

ricardobeat 10 days ago

I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.

link

manbart 11 days ago

How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?

It's surprising how little these things come up given the price they go for

link

ydj 10 days ago

The software stack is pretty immature, definitely very DIY. Their officially supported models are pretty old at this point, though there’s community support for gemma4, and models with GDN like qwen3.6 is supposedly very close.

The entire stack (minus some binary blobs in firmware) is open source, so if you have the time and persistence you can get whatever you want done.

A few community members have been working on support with llamacpp, where we can have supported operations offloaded to the TT cards, while having unsupported ops running on GPU or CPU. Llamacpp is pretty good at that. The existing kernels could definitely be better, and I’ll try my hand at writing some kernels some time.

link

shepherdjerred 10 days ago

Do you get anything useful out of your 4090 (I have one too)? Local cloud sounds like a fun idea but I just don’t see how it competes against OpenAI/Anthopic

link

ydj 10 days ago

I think it’s not really worth it compared to just buying tokens or a coding plan.

My setup has 4090 handling attention while TT accelerators handles MLP. With just a 4090 you can have CPU handle the MLP layers and use a MoE model, assuming sufficiently powerful cpu. I tried that setup with minimax 2.5 before, and was able to eke out around 10 to 15 tps (albeit with a 7965wx cpu)

link