| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jerf 797 days ago
	Behind each request made to OpenAI is a staggering amount of GPU computation. If the price of the queue request is even a hundred thousandth of the overall price of a single request I'd be stunned. There is no message queue scaling issue here. Message queue scaling issues arise when you are blasting around a lot of messages, but each of them take minimal resources on an individual basis to service, so it's feasible for the queue itself to be the bottleneck. I wouldn't be surprised a single Raspberry Pi could handle the entire queuing load here, and if it couldn't it's not off by a very large factor, because the GPU resources behind what it would take to service a full RPi's queuing capacity would be staggeringly enormous, I think well beyond what OpenAI actually has.

1 comments

tsimionescu 797 days ago

Isn't the same true then of an HTTP server? Handling the polling requests is a minute amount of work compared to running the real jobs. And you've addressed the scalability problem, but not the connectivity issues that generally plague long-lived connections on the Internet at large.

link

jerf 795 days ago

Not always. There are HTTP servers where you are making an HTTP request for an in-memory value where the work is less than the parsing cost for an HTTP request. There are many HTTP services where the time to fulfill the request is much longer than the parse cost of the request, but that time is not 100% CPU of either the server or any given service, because there's a lot of back and forth delays and latency. There are many HTTP services where they are 100% CPU and orders of magnitude greater than the cost of a HTTP request parse, but are still on the order of <1ms and if such a service was actually a message queue you might still be able to clog a message queue at least somewhat.

This is a very pessimal case, though. You make a tiny HTTP request which is parsed in microseconds at the most, and it invokes somewhere between one to five million microseconds of 100% utilization of an expensive resource. A thousand queuings per second would be fairly easy for a RPi, it could handle that no problem (at least assuming you use a decent language to manage it; a super fancy Python framework that also does a lot of metaprogramming might choke under the load, sure, though some half-carefully written Python still ought to be able to handle this), while those 1000 requests/second would require around 2000 GPUs to dispatch them in real time so we can maintain that 1000 rps. I'm pretty sure you can add an order of magnitude before I'd really start worrying about the RPi as a queuing mechanism, and you're getting to the RPi being able to queue for ~100,000 GPUs without too much strain. I don't know how many GPUs OpenAI has but that's got to be getting pretty close to their order of magnitude. They may have a million, I doubt ten million.

(Of course, I wouldn't actually do this on an RPi; I'm just using that as a performance benchmark.)

link

mike_hearn 797 days ago

Inevitably some users will decide to poll every 60 seconds or whatever, because they have no idea when the work will be completed and because what they really want is "results ASAP but willing to tolerate latency to pay less". And then your servers are doing a ton of TLS negotiation, user authentication, request serving and database lookups, just to answer "not yet".

I think people are getting distracted by the idea of connections being somehow expensive. They aren't really compared to polling (unless the poll is genuinely very rare). A stateless request is expensive because you have to go back to your source of truth on every request (probably an expensive and hard to scale RDBMS), and you don't control how often the user makes such requests. CPU load is potentially unbounded and users don't pay unless you introduce pay-per-poll micropayments.

Compare that to an MQ design: the overhead is a single TCP connection and a bit of memory to map that connection to an internal queue. Whilst the work sits in the queue or is being processed, nothing is happening and there's no DB load. Overhead is a matter of bytes and in the event that you run out of RAM you can always kick users off at random and let them exponentially back off and retry (automatically - because the libraries handle this and make it transparent). Or just use swap, after all, latency is not that important.

link

tsimionescu 796 days ago

Nothing prevents, in principle, a long lived HTTP connection where the server only replies once the response is available (long polling). However, on the real internet, such long lived connections just don't work, for a large minority of users. There are numerous devices, typically close to the client, which kill "idle" connections. NAT gateways and stateful firewalls are some of the most common culprits.

So, you just can't rely on your customers being able to keep around a long connection.

Not to mention the numerous corporate environments in which it is hard to even open an outgoing connection which is not HTTPS or a handful of other known protocols.

link

mike_hearn 796 days ago

Well, as I've said several times on this thread, good MQ libraries know how to reopen connections automatically if they break, backoff, retry, connect to several endpoints and load balance between them and so on. All this is an abstraction layer higher than what HTTP provides, so problems HTTP long polling can have in consumer/mobile use cases isn't necessarily relevant. It's like files vs SQLite.

As for the general issue of connections, that's true for consumer use cases. B2B workloads have far fewer problems with that especially when running in the cloud. If your cloud gives you mobile-quality internet then you have a problem, but again, it's a problem a good MQ implementation will fix for you. Consider the "lessons from 500 million tokens" blog post the other day, in which the author mentioned repeatedly that they had to write their own try/catch/retry loops around every OpenAI call because their HTTP API was so flaky.

And again, if you are behind a nasty firewall then you might find your connection dying at any moment because OpenAI got classified as a hate speech site or something. The fix is to file the right tickets to get your environment set up correctly.

link