Tesla seeks to work to standardize a new high-speed/low-latency fabric (be that TTPoE or otherwise) for AI/ML/Datacenters however theres nothing inherently abject about TCP as it exists today. RDMA over Converged Ethernet suffices perfectly well for whatever an "AI/ML/Datacenter" is and if we're being fair, the lackadaisical approach to the documentation suggests that they may not be taking it as seriously as they could anyway.
If Tesla were really seeking to shake things up they wouldnt have picked IPv4 to do it when the newest release has been around for nearly 30 years and has latency reduction baked in.
this smacks of a pandersome attempt from a company that sees the quite mandarin writing on the walls and has decided (in true Muskovite fashion) they too are just a misunderstood font of futurism.
TCP has the wrong abstraction for truly high performance.
I wouldn't necessarily standardize what Tesla does here, but most of the big companies have their own layer 3 transport protocol for things that need truly high speed and are operating within a datacenter.
Cray/HPE has their own Ethernet-based protocol (Slingshot was an earlier version of it - not sure what its name is now) which seems to be better than whatever Tesla has, but is not necessarily published.
- no IP layer (there's a ttpip folder in that repo though)
- distributed congestion control (TCP has a "window" field + a bunch of tentative RFCs, this has a purposeful "congestion")
- 100% implementable in hardware (TCP can, but it's complex)
Not a general TCP replacement, but the README properly highlights a "many endpoints local link" use case:
> the protocol executed entirely in hardware and deployed to a very large multi-ExaFlops (fp16) supercomputer with over 10s of thousands of concurrent endpoints. This protocol does not need a CPU or OS to be involved in any way to link and execute.
In Tesla's presentation slides, "Tesla Transport Protocol Over Ethernet (TTPoE): A New Lossy, Exa-Scale Fabric for the Dojo AI Supercomputer", they mentioned that the network layer is optional (but not removed)
I think it's better to think of it as a fibre channel protocol rather than TCP. It's intended for use on managed internal data centre networks.
It skips OSI layers to gain speed and probably do 100% hardware routing with FPGAs.
It's of no interest on the internet or any small scale netwwork.
FC is not entirely lossless. One ticket I had the joy of dealing with involved a customer using a Fibre Channel network for their storage using multipathd for failover. In theory it was a fully redundant configuration with dual FC ports on the server with each one going to a different FC switch all the way back to the SAN. However, the system was generating I/O errors on large writes while small writes would succeed. Needless to say that ext4 failed horribly, and there were worries that it was a kernel bug in the FC driver.
After a good amount of back and forth with the customer, and several test programs run on the system in question, I eventually came up with a hypothesis that there was an error in the write path of the SAN as small writes succeeded while larger writes failed. The customer ultimately found there was a dirty fibre on one of the links in their FC fabric. It was dirty enough to corrupt large packets, but not so dirty that smaller writes and control packets were unable to get through. Since multipathd only checks to see if a given target can be read from, it would never fail over to the other path (which was fine). So much for trying to build a high availability system using an expensive SAN!
Lesson of the story: what you think is a lossless network is not always lossless. Using the IP stack has a lot of beneficial diagnostic tools that you really start missing when something goes awry in a non-IP network.
Broken hardware does not make the protocol lossy. I think you're misunderstanding what 'lossless' is intended to mean in this context; it does not mean that it is error-free. In a lossy protocol, missing data is not necessarily an error. In a lossless protocol, missing data is treated as an error, which is consistent with what you experienced.
I do understand what lossless means. The point of my anecdote is a tale of warning that when going off and start designing new network protocols, especially one as bare bones as TTPoE, you need to consider what happens when someone has to deal with things going wrong. Diagnostics and maintenance matter in the real world for people running large systems with thousands or millions of moving parts. IPv4 and IPv6 bring along lots of tools that help in these scenarios, and IPv4/v6 headers don't actually have all that much overhead to parse and generate in hardware, plus they are protocols that have been around long enough to have many widely available hardware and software implementations in open source or to be purchased from vendors. I'm certain that there will be times when sysadmins will be cursing the fact that the folks who implemented TTPoE didn't have a ping-like tool available from the start.
FC should be able to detect errors. I've had alerts shout at me when an FC switch detects a dropped packet.
More over, the multi-path should have stopped that! it should have detected a bad link and failed over to the other one (but the config for that is hard, so I can see why that might not worked. )
Last time I checked, multipathd does not and cannot detect faults on the write path as it only performs small reads to check the health of any given path. Checking writes would involve allocating space on the disk for multipathd to safely write to. Maybe someone has changed that in the past decade? I don't know as I'm not involved in anything SAN related anymore (and thank goodness for that!). SAN hardware is particularly awful as the underlying network is essentially hidden from the operating system most of the time. Storage subsystems built 30 years ago were built without any consideration that they might running on top of networks.
These and many other performance issues left me with a particular hatred of SANs.
There was a talk about this prior. This was used in place of TCP, but where TCP is designed to run over unreliable networks, this protocol achieves speed and latency figures comparable to others, while still being able to retain commodity IP switches in the cluster. By having a fixed buffer, no lingers, faster opens, they increase the speed and latency, without going to dedicated vendors or other stacks.
> Be interesting to see how this stacks up to the dominant protocol in supercomputers/ai clusters : Infiniband.
As mentioned in README, this was submitted to the larger Ultra Ethernet consortium for consideration:
> Deliver an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale
How is this better than UDP? Or for that matter, just plain old Ethernet MAC addressing? You can achieve lower latency and speed (than this) if you don't care about reliability in your transport layer.
I don’t think mass adoption is their goal. They had a problem. They solved that problem. They shared how they solved said problem.
Every engineering company releases stuff like this. It’s not meant to change the world. It’s marketing to recruit other engineers who would find that problem interesting.
> Tesla also announced joining the Ultra Ethernet Consortium (UEC) to share this protocol and work to standardize a new high-speed/low-latency fabric (be that TTPoE or otherwise) for AI/ML/Datacenters
Also it's a protocol, personally I will only use a protocol that's fully spec'd. It's a pain sometimes to have consensus among all contributors but it's valuable.
> edit : I will only use a protocol that's fully spec'd IN PROD
Why does this not inspire confidence in Tesla. Their internal software stack is available to their own developers who can review what is actually there.
Why does it have to be perfectly documented in a public github? Are all other car companies "properly" publically documenting things in github?
Does it inspire more confidence in VW's software stack if they don't share it? Is VW's confidential stack some big competitive advantage? I've used a VW ID electric vehicle. I did not come away that impressed.
CAN (or one of its more modern variants) are historically more common in automotive. However with 2-wire Ethernet connections becoming more commonplace I do think you're right that more and more cars will be moving to ethernet fieldbus.
EtherNet/IP is not as robust for many applications as its competitors (PROFINET, EtherCAT) since it is not fully deterministic. EtherCAT is my personal favorite.
Please no EIP, its utter crap and designed by an OOP huffing committee. The only serious protocol is EtherCAT with honorable mentions for Sercos 3 and Ethernet Powerlink (CANopen over Ethernet).
Of all the (current) industrial protocols they could have picked, Ethernet/IP would be the worst.
Its only advantage is that it can coexist with other TCP traffic and run over standard switches, but that just results in unreliable fieldbus performance.
In a sense this wasn't from Tesla the car company, but Tesla the IT department with a supercomputer. I don't know what they do on it though, might be lots of physics simulations (aerodynamics etc) or deep learning for assisted driving tech.
They train an end-to-end model to drive based on 8 camera streams and recorded input from human drivers, training on tens, (if not hundreds now) of millions of 30 second clips from their consumer fleet. That's why they're bought one of the largest GPU clusters and making their own chips and transport protocols.
It's not widely known, but Tesla probably has one of the largest training cluster, because practically all the GPUs they buy go towards training, while most of GPUs for e.g. OpenAI go towards inference. Tesla does inference in the car.
Can you please not post like this? Regardless of who you're talking about or how you feel about them, it's not what this site is for, and destroys what it is for.
* https://news.ycombinator.com/item?id=41374663
* https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...