Hacker News new | ask | show | jobs
by annowiki 1337 days ago
I come from a Python background so libraries were always just a pip install away.

In the past year, however, I've been working on a large C++ codebase (couple million lines) and the result has been a considerably greater amount of "roll your own."

This has filtered back into my Python. If it's not in the standard library, I don't install it unless I really need it.

My boss always tells me that you can probably write it faster than a library, something I never used to believe. Until I tried it. I needed to check which of two version strings, of XX.XX.XX.XXXX format, were bigger. I tried the most recommended version number library, then I tried writing my own solution, the simplest solution I could think of:

    def version_compare(v1: str, v2: str) -> int:

        """
        Compares two versions.

        Parameters
        ----------
        v1: str
        v2: str

        Returns
        -------
        int
            1 if v1 is greater, -1 if v2 is greater, 0 if equal.
        """
        for el1, el2 in zip(v1.split('.'), v2.split('.')):
            if el1 != el2:
                return 1 if int(el1) > int(el2) else -1
        return 0

My code was faster by like 20x. Libraries are bloated and you probably only need a small subset of the functionality, so write your own code has become my mantra.
8 comments

Another benefit, that IMO is as big if not bigger, is that you can easily make changes to this code whereas changes on the library side may be hidden behind configuration, impossible at all, or god forbid even require a wrapper anyway. In the above example, if you wanted this instead to return the version number string that is larger, that's an easy change and one that's very obvious from the PR what it is doing.
As a long time C++ developer, the modern trend of library package managers has always seemed insane to me. In C and C++ integrating with a dependency is no trifling matter. Beyond just simply making your executables bigger, you're also making the build process more complex, you're locking yourself to an interface, and you probably need to keep special considerations in mind; for example, some libraries need you to call a function when you load them. Therefore, people don't make trivial (i.e. things anyone can do in a few minutes) libraries, and you don't add dependencies unless you really need to.
When I started playing with C/C++ the first thing I complained about was how difficult package management was. Conan is not as simple as cargo or pip. It's often simpler to find a header only library and plop it into your repo, but you still have to modify build configurations to include it.

This is the first time I thought to myself "Maybe that's actually a good thing."

When you're prototyping, a package manager can be really convenient, because it lets you try things out really quickly without investing a lot of time. But once you have most things in place you really just want something that will remain stable for a long time.
This returns 0 for '1.2.3' and '1.2'. Might be fine if you are sure they have the same number of components, but otherwise is worth checking.
Another slightly more general approach, which turns the string into a tuple of numbers and then uses tuple comparison, which gets "1.2.3" > "1.2" right, but does yield "1.2.0" > "1.2" which may or may not be what you want:

  v1 = tuple(map(int, v1.split('.')))
  v2 = tuple(map(int, v2.split('.')))
  return (v1 > v2) - (v1 < v2)
(In Python 2 days, the last line could have been `return cmp(v1, v2)`, but sadly cmp() and .__cmp__() were removed in Python 3.)

And both of these implementations demonstrate an advantage of using someone else’s library: it has probably had more care put into it. And implementing these sorts of things yourself often leads you to compromise on functionality—though at the same time, using a misfitting library also leads to compromises. It can always go both ways.

I think you're demonstrating why libraries are slower.

For the specific use-case the OP had, the sorts of comparisons you're doing are unnecessary, so a custom-tuned compare is faster.

My experience is that it’s frightfully common for people to unintentionally oversimplify their implementation, or to forget that their implementation is too specific and improperly use it in a more general context. For example, to think they are only dealing with /^[0-9][0-9][.][0-9][0-9][.][0-9][0-9][.][0-9][0-9][0-9][0-9]$/ (and if the syntax really is that simple, then simple str comparison would be enough, as another comment suggested), but then discover somewhere down the line that perhaps some of the two-digit components can actually grow to three digits, or perhaps some suffix is added, or an additional component; or perhaps just use it for a different type of version string somewhere else in the program. Perhaps years later. (Mind you, if it violates your format expectations it may be better to immediately raise ValueError rather than using potentially-different semantics as you’d get if you used, say, packaging.version.parse().) As it stands, the provided implementation didn’t mention its limitations in its documentation. That is bad and makes it much more likely to be used improperly. It should have said something like “compares version numbers in 'XX.XX.XX.XXXX' format”.

(I’m not speaking for BYO or library philosophies, merely describing considerations and caveats of both.)

I think you're demonstrating why libraries are slower.
… because they’re probably better thought-out and less fragile?
GP had made clear the format was fixed, and writing 1 or 2 tests for it takes seconds, so I don't really agree with your conclusion.

Personally I mostly wanted to offer a warning to somebody else who might take this code.

For our purposes, we never have strings that are other than the form XX.XX.XX.XXXX so there was no reason to generalize. Which improves the speed, makes it easier to read, and provides all the more reason not to use a library.
Meanwhile I've had to make changes to hand-rolled version comparison code in multiple systems because the original developers didn't account for the second segment of the version being greater than 9.

I'm absolutely not saying that you should have used a library, but pointing out the other side of the problem. If they had used an existing library to handle the versions, it wouldn't have had this issue. And test cases wouldn't help here because if they didn't think to code for multiple digits in the second spot, they likely wouldn't have thought to test for it.

There's enough benefits and drawbacks to library vs roll-your-own that I don't rigidly stick to one way or the other.

Or, you would have the same issue and the library developer wouldn't have time/care to fix it...

Also good sweng practices would dictate to put it in one function that is used in all places so you only need to fix it once.

Now think about how (some) of these libraries came to be:

Someone like you wrote something like the above for their needs. Then they needed it again in another program. Then in another. They figured, why not put this into a library I can use across all my projects. Maybe this was internal only. Maybe at some point they thought: why not push this out there to github and upload it to a public repo and let other people use it.

Then someone (maybe the same person) needs not just XX.XX.XX.XXXX with actual numbers. For whatever reason their standard is vXX.XX.XX.XXXX[-RCX] or other variations that aren't just a simple number format.

Of course if you never ever have a need for that kind of version number (or whatever other "problem area" we apply this principle to) and only ever write one piece of software, you're fine with roll your own. But this is (one way) how libraries get "bloated". The fact that they are libraries (or frameworks if we think a little wider) need to take care of "all the problems" and not just a very specific one. Especially if it's open source and not an in-house only library which attracts lots of people with lots of different variations on the same problem. It's very hard to have a very opinionated stance of "this is what you get, change your version number standards, I'm not changing my library". Of course you can do that, but your library will probably end up in obscurity. The "can deal with anything you throw at us" is probably going to be more popular and if there's no Linus type guy doing code reviews before accepting contributions, you end up with sub-optimal code very quickly.

I've done the opposite before. I came to a code base that had various variations on sending email notifications for batch jobs. Like literally 20 jobs with 20 variations (well maybe it was 15 different ones with different types of bugs and quirks). I extracted them into an in-house library for sending a standardized version of these emails. In the end it had accumulated quite a few features for adjusting the email template, attaching various formats of output, automatically zipping it up etc. I'm pretty sure you might call it "bloated". In reality it made things better (consistent) for the users receiving those, easier for us to understand errors, faster to code new jobs and just calling one standard function that everyone knew how to use and that all bugs had been ironed out from over time. Import library, call function. Instead of deciding which other project to copy the email sending code from, finding out where in the mess of code it was, throwing it away again because "Oh yeah right, that project's version of it can't deal with this type of attachment etc.

TIL python type hints: https://docs.python.org/3/library/typing.html

Thanks.

Little bit off topic perhaps but why encode the types a second time in the comment when you already have the annotations?
Its a numpy style docstring. PyCharm can be configured to auto generate them. I guess I just add the types to be thorough.

Ideally I want to generate api documentation via a tool. A long time ago I wrote a script to parse the AST and generate markdown from the docstrings and function information, but it was mediocre and I haven't wanted to use sphinx because its too heavy and doesn't seem to produce ideal markdown output.

numpy seems to sometimes use type hints in the docstring, other times not: https://numpydoc.readthedocs.io/en/latest/format.html#parame...

This was one of the interview questions for my first ever programming internship at 19.
why do you split the strings instead of comparing them directly ?

  >>> '99' > '11'
  True
  >>> '100' > '11'
  False
oh, I understood that the number of digit was always the same
As described it probably is, but using string comparison would make what is already fairly fragile (they must have the same number of components) even more fragile and sensitive to change (they must now always have dots and digits in the same place, and the failure mode is now completely invisible). More stuff to be aware of. In practice I find precious few actually fixed-width formats like this, and more than a few pieces of software have struggled to go from version 9 to 10 or 99 to 100 because of bad assumptions. Or even date formats, two digit year was fine until 2000 when you either wrapped around to zero again, or went up to 100 (and some date APIs have modelled years as the number of years since 1900 in this way).