Hacker News new | ask | show | jobs
by pmontra 625 days ago
This runs with a Geforce GTX 1060. By a quick search it's 120 W. Maybe it's only the peak power consumption but it's still a lot. Do commercial products, if there are any, consume that much power?
4 comments

There's a wide range of inference accelerators in commercial use.

For "edge" or embedded applications, an accelerator such as the Google Coral Edge TPU is a useful reference point where it is capable of up to 4 Trillion Operations per Second (4 TOPS), with up to 2 Watts of power consumption (2 TOPS/W), however the accelerator is limited to INT8 operations. It also has around 8 MB of memory for model storage.

Meanwhile a general purpose or gaming GPU can support a wider range of instructions, single-precision, double-precision floating point, integer, etc).

Geforce GTX 1060 for example: 4.375 TFLOPS (FP32) @ 120W (https://www.techpowerup.com/gpu-specs/geforce-gtx-1060-6-gb....)

There are commercial-oriented products that are optimized for particular operations and precision.

Here's a blog post discussing Google's 1st-generation ASIC TPU used in its datacenters: https://cloud.google.com/blog/products/ai-machine-learning/a...

(92 TOPS @ 700 Mhz - 40W)

https://arxiv.org/abs/1704.04760

Sorry I’m not familiar with TPUs only GPUs but how much VRAM do Corals have? YOLO 11x is 56M params which if it was quantized to int8 would still be 56MB. Plus you would need some for your inputs.
The Coral Edge TPU has approximately 8MB of SRAM for model weights/parameters.

https://coral.ai/docs/accelerator/datasheet/

It does not have VRAM as it is not a graphics card :)

There are examples and instructions for exporting Yolo variants to run on the Edge TPU: https://docs.ultralytics.com/guides/coral-edge-tpu-on-raspbe...

I have something similar. It's not tracking though. Drawing around 10W on a pi, around 7W on a Jetson.
not sure if i'm misunderstanding - you've got a similar GPU to a 1060 hooked up to a pi?
OP is probably using an AI accelerator like this: https://coral.ai/products/accelerator which works great on a PI and uses very little power. It will do the Yolo part, but you can't really expect it to do the multimodal LLM part, although you could try to run Florence directly on the PI too.
coral has pcie module which is 1/4 to 1/3 of the price
Not a pi. A Jetson. Still an arm SBC though.
YOLO is quick enough that you can just run it on a CPU, assuming you don’t want to run it at full resolution (no point) and full frame rate (ditto) for multiple streams. When you run it scaled down at a 2-3 fps you’ll get several streams per CPU core no problem. Energy use can be minimized by running a quick motion detection pass before, but that would obviously make the system miss things creeping through the frame pixel by pixel (very unlikely if you ask me)
You can use a Coral USB Accelerator, doesn't use more than 10W.
You can see here:

                res = rest(ollama, {

                    "model": "llava",

                    "prompt": genprompt(box.name),

                    "images": [box.export()],

                    "stream": False

                })

They are calling the ollama API to run Llava. Llava is a combo of an LLM base model and + vision projector (clip or ViT), and is usually around 4 - 8GB. Since every token generated needs access to all of the model weights, you would have to send 4 - 8 GB through USB with the Coral. Even at a generous 10gbit/s that is 8GB / 1.25GB = 6.4seconds per token. A 150 (short paragraph) generation would be 16minutes.
Hm yeah sure, I didn't think of the llm part. I don't think it's really useful tbh.
Can confirm. The Coral inference accelerator is quite performant with very low power draw. Once I figured out some passthrough and config issues I was able to run Frigate in an LXC container on Proxmox using Coral USB for inference. It's been working reliably 24/7 for months now.
Yeah. But it’s likely it’s an 8-bit quantised, likely very small model with a small number of parameters. Which translates into poor recall and lots of false positives.

How many parameters is the model you are using with hailo? And what’s the quantisation and what model is it actually ?

Honestly I have no idea what you are asking about. It's just dedicated hardware to a yolo-like object detection model
They are asking about LLMs. There is a confusion it seems -- you are thinking of the object detection model (YOLO) which runs perfectly fine in (near) real time with a Coral or other NPU. The parent is referring the Llava part, which is a full-fledged language model with a vision projector glued onto to it for vision capability. Large language models are generally quantized (converted from full precision float values to less precise floats or ints for instance F16, Q8, Q4) because they would otherwise be extremely large and slow and require a ton of RAM (the model has to access the entire weights for every token generated, so if you don't have a gigantic amount of VRAM you would be pushing many tens of gigabytes of model weights through the system bus slowly).
Recall and false positives are classification metrics which relates to the YOLO part.