Hacker News new | ask | show | jobs
by HedgeMage 3406 days ago
This is pretty much what happened. We spent a few months working with Mr. Stenn, and ultimately he did not agree to pursue strategies to correct the underlying problems that caused NTP's security and stability issues. Simply patching known vulns and moving on would have been a temporary solution: more vulns were lurking. NTPSec was born to give the code base another chance, to evolve with a different strategy. In the end, I tend to feel that this is a strength of OSS: different groups are free to do things different ways, and if people are paying attention, software quality should win out.

Since Eric and the rest of my team started working on the NTP code base in early 2015, we've eliminated over 50% of its vulnerabilities before they were disclosed simply by applying good software engineering practice where it hadn't been. In the year before my O'Reilly presentation, it was more like 80 or 85 percent. Everything we hadn't eliminated by disclosure or discovery time was fixed promptly.

There are other NTP protocol implementations besides NTP classic or NTPSec that are worth considering for some users. However, we felt that refactoring the reference implementation was necessary due to its use in many less-mainstream, but often highly-critical (in a life-critical or economically-critical or critical-to-scientific-research sense) applications. The non-NTP-related implementations don't always do what high speed trading houses need, or scientific installations built on aging but extremely precise equipment need, or controls system interfaces need, and on and on and on. We just didn't have a drop-in replacement available for all of the things that weren't web servers, workstations, and other commodity applications.

The "rift" article is now subscriber-only, so I can't respond there to its many inaccuracies (I was passed a PDF by someone who cached it, this is the only way I was able to read it). I was never contacted about it by the author, and I don't feel it was a fair treatment of the subject. That's okay. I learned a long time ago that fixing a mess will make some people thank you and some people angry with you. It wouldn't have become a mess by the time I found it if there weren't a cost to fixing it. People who fear controversy will have a hard time making a difference in the world.

I'm at work, but I'll do my best to answer any questions fired at me today on this thread. If there's something you want to know, ask!

3 comments

NTPsec advocates keep saying "eliminated 50% of vulnerabilities <<<before they were disclosed>>>", as if there were another meaningful way to eliminate vulnerabilities from a codebase.

Can you provide a breakdown of the vulnerabilities NTPsec HAS and HAS NOT been vulnerable to, along with their severity (low: degrades time service, medium: provides a practical vector for corrupting integrity of time service, high: compromises integrity of the server itself) and whether they're exposed (a) in the default configuration, (b) in a configuration run widely on the Internet, or (c) in no configuration actually known to the project maintainers?

You clearly have the list somewhere, because everyone involved in the project has this statistic ready to quote.

If you don't have the severity and exposure breakdowns, that's OK. Post the list anyways. Maybe it'll be obvious what the severity and exposure is.

This business of counting vulnerabilities and claiming victories has been a problem for software security for two decades now. Ops people don't care about the vulnerability count, if the vulnerabilities left exposed in the codebase are the ones that get their servers popped.

I'm sorry if I wasn't clear, I meant "before they were disclosed to NTP classic or NTPSec". In other words, by simply improving on the software engineering practice, we eliminated classes of vulnerabilities without having to track them down individually. This is pretty common with ailing code bases, though often overlooked. I'm at work right now, so I don't have a comprehensive list handy. Going through NTP classic vulns and seeing how many never impacted NTPsec would recreate such a list.

The severity varies (many weren't that big, some were)... the point of claiming the victory is to demonstrate that I'm not just having a fuss about testing code, using static analysis tools, using an accessible code repository, refactoring for lower attack surface and better separation of concerns because they are beautiful in abstract. I like results. NTPSec, and before it the temporary "rescue" team, have been slowly chipping away at the big picture mess, making the code safer and more maintainable, because it's likely to remain in service for another decade or two.

Every time 14 vulns are disclosed and we are already immune to half of them, we get to put twice the effort on the half we do need to deal with, if even we need that much. We aren't just firefighting, NTPSec can develop proactively. That means something for our users.

lots of personnel overlap here...the main difference being pre- and post- fork and where the funding came from, probably not interesting to most people.

No, I understood your meaning. I'm saying: that's what every code refactoring does. I'm saying that since you can't claim credit for eliminating vulnerabilities that are already disclosed, the emphasis you place on precluding vulnerabilities is strange.

Can you provide that list of vulnerabilities now? You're obviously keeping track of them, that being part of the premise of the project. I know you don't have them broken down, but we can help with that.

How about this: before I put the effort in to generating the list myself, can you at least promise to confirm that I have the complete an accurate list once I do, and to fill in any gaps?
As a side note, I'd like to add a point that I highlighted in my O'Reilly Security Conference talk but previously forgot to mention here...

One of the coolest after-effects of this whole thing was that, after the fork, when NTP classic began feeling the pressure of competition, their speed in addressing security vulnerabilities increased incredibly. While I was sorry that it didn't happen on its own, I was pleased and impressed to discover what Mr. Stenn was capable of once his competitive hackles were raised.

Many people experience hurt feelings during a fork, and a fork represents a frustrating duplication of effort that I'd usually rather avoid. However, forking is a central tenet of the open source ethos for a reason. Competition can do incredible things. <3

If a primary purpose of forking ntpd was to give the original project a kick in the ass about fixing vulnerabilities, could it not be argued that your project has now served its purpose, and dollars could be better spent on building from the success of "NTP Classic" --- which, after all, is the version of NTP most likely to be deployed?
I would agree with you if NTP classic had fixed the total of its social and technological problems. Unfortunately, this is not the case. "Patching faster" is one small victory.
What percentage of "NTP Classic"'s problems are managerial/social and what percentage are raw technical?
> In the podcast, Sons depicted NTP as a faltering project run by out-of-touch developers. According to Sons, the build system was on one server whose root password had been lost.

> Stenn denied many of Sons's statements outright. For example, asked about Sons's story about losing the root password, he dismissed it as "a complete fabrication."

Unless either you or Stenn is outright lying (neither of which seems likely, on priors), this seems like a strange misunderstanding to crop up. Do you know what's going on with this?

I know what his side of the story is on that specific password. I don't think it's adequate, but... I also don't know that it's helpful to keep arguing this two years later. Casual contributors couldn't build the latest dev version of NTP due to repository access and build system problems, and the lead (effectively only active, at that point) maintainer couldn't or wouldn't fix the situation.

While the password problem made a good rhetorical flourish--it illustrated how the scaffolding supporting NTP development had been allowed to rot--the fact is that the server was in Mr. Stenn's control and he could have rebooted it to rescue media at any time, fixing the problem in a few minutes. Yet, the server was never properly brought up to good maintenance practice. I suspect that the majority of people reading this know how to reset a root password, so the password doesn't really matter that much in the grand scheme. The server was just another thing being neglected.

As I described in my O'Reilly talk, technical problems of this magnitude stem from social problems. The project didn't have a culture of sound engineering practice. I did what I could to work with Mr. Stenn to offer support and resources to bring that practice to his project. I didn't want to lose the years of institutional knowledge he'd acquired working on NTP. That's costly to replace. However, I wasn't going to forgo sound engineering practice to keep him on board: over time, smart people could learn the ins and outs of even the most tangled code base. The costs of bad engineering practice just keep coming, and I cannot force people to do the right thing, only lay out the costs and benefits then see what they choose.

That, and throw a little storytelling prowess at the problem now and again, in the hope of motivating people.

From the article (and the end result) it seems the "strategies" you talk about that were rejected include a total rewrite and an abandonment of features and platform support.

You can eliminate vulns and improve stability a lot of ways. Total rewrite is definitely not the best way. Even if you're the best programmer in the world, rewrites often run into old bugs as well as new ones, and require a lot of testing and a lot of repeated effort.

And I can't speak for the other infosec nerds, but for me, name-dropping ESR does the opposite of inspiring confidence in a security-focused project. I wouldn't trust him to secure my shoelaces.

If the old codebase was really bad, perhaps eventual rewrite would have been useful. But what would help existing users more is fixing the existing product so they can upgrade in place and be more secure, and not forcing them to go through a whole product migration cycle just for better security.

The specific measures that were refused, from the rescue plan that Mr. Stenn rejected and my notes from that meeting:

* Moving NTP development from a private Bitkeeper repo which requires all people accessing it (10 at most without private license purchase, given that Network Time Foundation has only 10 licenses) to agree to a restrictive license that may interfere with their other development work, to a public git repository which is accessible by the public as a whole. Stenn felt that tarball releases were sufficient, and did not agree that giving the public an opportunity to see code prior to release was important.

* Releasing patches to NTP vulnerabilities to everyone at the same time. NTP had a practice (for which Mr. Stenn never explained to me the reason) of releasing vulnerability patches to a closed group months or more ahead of the public release. These patches were typically leaked fairly rapidly and turned into exploits which were then used against NTP deployments in the wild.

There were other disagreements, but these were the big two technical disagreements upon which Stenn walked away. They were not points upon which I was willing to compromise, especially given that neither I nor other people in a position to help NTP could possibly have signed Bitkeeper licenses while maintaining our primary employment. This was a massive roadblock for increasing contribution to NTP, from us or anyone else.

If you look at the slides from my O'Reilly presentation here: http://slides.com/hedgemage/savingtime you will see that even when the rescue proceeded without Stenn, we did not do a major refactor! Slide #20 outlines the original rescue, which had 4 points:

* migration to git

* replacing the build system (when Stenn had been on board, we'd intended to repair the build system in-place, but without the mystery scripts residing on his build box, we decided that a from-scratch replacement was more reliable and efficient than to reverse-engineer and repair)

* updating documentation so that new developers could be onboarded

* fixing what vulns we could given limited resources

That is it. Refactors came later when, after this "rescue" work, Mr. Stenn declined to use these work products and the NTPSec fork was born.

We did make every effort to avoid a fork, but in the end, I could only offer help, I could not force anyone to take it. Forking is, in the end, the OSS community's last protection from failing projects.

So why not maintain a patchset to be applied to the original and maintain it on your own repo? There's nothing you can do if someone is holding security patches from you, but you could certainly release your own to the public.

Honestly, both of those sound like very common issues which do not result in whole new product forks. Large projects maintain patchset and private security lists all the time. To me the ends don't justify the divergence.