Show HN: PhAIL – Real-robot benchmark for AI models

Y	Hacker News new \| ask \| show \| jobs

Show HN: PhAIL – Real-robot benchmark for AI models (phail.ai)

21 points by vertix 87 days ago

I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know.

PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running.

Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+.

Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions.

Happy to answer questions about methodology, the models, or what we observed.

[1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model

5 comments

chfritz 87 days ago

This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results!

link

vertix 87 days ago

Feel free to reach me out via hi at phail dot ai

link

apetrovicheva 87 days ago

This is amazing. Loved watching the videos with real-world attempts.

Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do.

link

vladimir_gor 87 days ago

I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow.

link

akshaisarathy 87 days ago

If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation?

link

vertix 87 days ago

All real hardware, no simulation. Franka FR3 arm with a Robotiq gripper, physical totes, real objects. Every run is recorded with synced video and telemetry (you can watch any episode on the site).

That's the whole point – simulation benchmarks exist, but operators deploying robots care about real-world performance.

link

anna_pozniak 87 days ago

I'm curious! What other models you're planning to add to the leaderboard?

link

vertix 87 days ago

We're working on adding DreamZero (NVIDIA's latest) next. The leaderboard is open to any model – both open-source and closed-source. If you have a checkpoint, we'll run it on the same hardware under the same blind protocol. Closed-source participants can submit their model as a container and we evaluate it without accessing the weights. Reach out at hi@phail.ai if you want to submit.

link