Hacker News new | ask | show | jobs
by cletus 1513 days ago
I worked on Google Fiber and one of the things I did was write a pure-JS speed test. At the time, speedtest.net still used Flash. Why did we need this? Installers used Chromebooks to verify an installation so we wanted to be able to tell if the install was successful. That means maxing out the connection (~940Mbps for a gigabit connection). This speed test is still up [1].

Actually figuring out the max speed for a connection is a surprisingly hard problem. Here are some of the things I found:

1. Latency is absolutely everything. With sub-2ms latency I could get 8.5Gbps downloads in a browser on JS over 10GbE on a Macbook Pro. Bump that up to 100ms and that plummets. I forget the exact numbers but this has real world consequences. Australia, for example, rolled out it's ridiculous NBN network with a max speed of 100Mbps. Well Australia has a built-in latency of 150-200ms to the US just by distance and the max effective download speed would be a mere fraction of that;

2. Larger blobs are better for overall throughput but depending on your device this may blow up your browser. Unfortunately for the Internet you're never really going to reliably get an MTU >1500 unles you control every node on the network;

3. This sort of traffic exposed a lot of weird browser bugs, even with Chrome. For example, Chrome could get in a state where despite all my efforts the temporary traffic would get cached and would fill up your /tmp partition on Linux and blow up with weird errors that don't relaly give you any clue that that's the problem and only restarting Chrome will solve the issue. I could never figure out why. Not sure if it's still an issue;

4. The author I guess was talking about Linux defaults but there are a lot of kernel parameters that affect this (eg RPS [2] is absolutely esential for high-throughput TCP beyond a certain point);

5. BBR was in development at the time (ironically I was next to that team at the time for a few months) so I can't really speak to how this changes things. I was going this development back in 2016-2017;

6. Among people who knew more about this than me the consensus seemed to be that BSD's TCP stack was superior to Linux's. Anecodtally this is backed up by real-world examples like Facebook having extreme difficulty moving away from freeBSD to Linux for WhatsApp. That took many years apparently; and

7. I agree with the author here on the impact of packet loss. It's affect on throughput can be devastating and (again, pre-BBR) the recovery time for maximum throughput could be really long.

[1]: http://speed.googlefiber.net/

[2]: https://access.redhat.com/documentation/en-us/red_hat_enterp...

4 comments

Netflix's speedtest https://fast.com/ avoids a lot of the TCP tuning issues of the client/server by dynamically scaling the number of streams during the test to try to provide an all around peak number instead of a "precisely single session to this exact server" number.

Regarding BBR I've also found it to be a lifesaver for individual streams over high latency internet links, particularly when there is loss.

In a big city if you're on a last mile service like webpass (Google fiber) you are most likely 2.5ms, not more, from a speedtest.net server that has a dedicated 10GbE port off some regional ISP's aggregation router. Possibly even within the same ISP and your same asn hosts the server internally.
The speedtest looks quite similar to the one baked into Google's search results; did any of your codebase make it across?

Also, kudos (?) it outlived Google+ as linked from the footer :-D

The seach result speedtest is unrelated and was being developed at the same time IIRC. It has a similar philosophy to the Ookla speedtest: just to test if your connection is sufficient and not test the max capacity.
Why was latency such a killer on throughput? Was it TCP window sizing?
It's related to the Bandwidth Delay Product:

https://en.wikipedia.org/wiki/Bandwidth-delay_product

https://networklessons.com/cisco/ccnp-route/bandwidth-delay-...

Every now and then some network middlebox ends up screwing with your TCP window sizing and then the bandwidth delay product rears its ugly head.