Hacker News new | ask | show | jobs
by rburhum 3797 days ago
Yesterday I was being a bit of an ass to a few people about how "the whole point of using git is so that we can do decentralized code management and why these dependencies were being pulled from our private github if the could be sent point to point yadda yadda yadda". Then they proceeded to go over the list of package managers and dependencies we used and I had to shut up. Even when we host our own Docker Hub and package managers (we do), if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever. It is crazy how everything has changed so much in the past few years. GitHub went from something that was really nice to have to a core requirement for complex systems that rely heavily on open source.
6 comments

If this is a problem worth solving, you can absolutely solve it. The easiest would be through the use of a caching proxy and/or load balancing system.

The caching proxy system could be as simple as setting up a squid cache for apt. Multiple projects exist which do this already.

The load balancing system would involve keeping a private mirror of every repository in the dependency graph, and falling back to the mirror when GitHub fails. To automate this, proxy all git requests. If github is up, let the request pass through. If no mirror exists for the repository, create one. If GitHub fails, fall back to the mirror.

> The easiest would be through the use of a caching proxy and/or load balancing system.

Sorring if I sound like an asshole, but before we actually use the word "easiest", can you please share with us how to do all of that. It is not as easy as you claim, to be honest. Not just some /etc/hosts hijack.

They didn't say it was easy, just that it was easiest. Ostensibly comparing it to other solutions that they are aware of.
For the load balancing system I would assume you would also want to keep the mirrors up to date as well? So for every request if mirror exists and is out of date, update it.
Presumably that would be the responsibility of the mirroring system once a mirror is initiated.
The package system for the rust language actually relies on github, as many found out during outage. I don't know if that will change, probably will with a read copy in a different git service.. but I thought it was interesting because I use github for everything save a few private projects, as I imagine most do. I'm not sure what to think of this, it seems backwards and grossly incompetent, yet here we are using it almost exclusively. It might be smart to decentralize some of this with torrents, if that's possible. Even if it was the read portion of a repository, it seems like something to consider, if it hasn't been already
> The package system for the rust language actually relies on github, as many found out during outage.

This is not quite correct (although close to it). Cargo doesn't rely on GitHub, but it expects that there is some publicly-accessible git repository from which it can pull the source for any crate, and most crates use GitHub. So it's not a particular choice of Cargo, but a side-effect of GitHub's popularity in the community, and the fact that Cargo does not host source code itself.

You're mixing up cargo features and the crates.io package system. Cargo does allow git dependencies but primarily you're supposed to use versioned crates.io packages, which are indeed provided by crates.io (even if they are actually hosted on S3 or whatever) not GitHub.
The crates.io index is in a GitHub repository. Does Cargo fetch it directly from GitHub or from a set of redundant mirrors?
If crates.io hosted the content themselves, it would just be the same problem, only with a service potentially less reliable than GitHub.
Unless you set up a system of mirrors. There are plenty of examples[1] to draw from.

1: http://mirrors.cpan.org/

Not if they acted like a mirror. Put it on GitHub, Bitbucket, and crates.io
Not just rust language, to the best of my knowledge, even packagist, the php package manager relies heavily on github for sourcing its packages. But I think they have other resources too, apart from github.
Ruby's bundler doesn't entirely rely on Github, but pulling from a Github repo is a supported option that many take advantage of.
Rust's package manager doesn't source packages from GitHub (though it will pull packages from a git repo if you ask it to), the source for its index of packages is a git repository on GitHub. https://github.com/rust-lang/crates.io-index
if you dig far enough, you can find some dependency of a dependency of dependency that relies on GitHub. Brew/npm/build script/whatever

But really, why?

Is it just institutional laziness on the part of all developers? We had reliable rsync CPAN mirrors in 1995. In the early days of the Internet, companies would mutually host secondary DNS for each other to be more reliable. For some reason, we've forgotten all about reliability and disaster recovery and geographical distribution. Now the collective programmer mindset with regards to global infrastructure seems to be "lol, we're too dumb to make things work, let's just outsource everything to closed source, for-profit companies and hope for the best."

I think a large part of this is that cloud hosting has allowed us to abstract those problems - reliability, disaster recovery, geographical distribution - away, and we don't really think of computers as computers anymore. It's a service or a platform or what have you, and the expectation is that it will always be there. I wouldn't say this is laziness, just a byproduct of changing how we view Internet architecture. We systems to take care of reliability etc because everyone has those problems. Now, those are only things you'll experience if you host your own stuff, or work for one of the big providers. (Broad assertion, I know, but I think it's mostly true)
One has to keep in mind that there is no cloud. It's just someone else's computer.
Except that it is not. It redundant array of computers, if one goes down, another takes it place and all the apps running on it are migrated to the new hardware. And if the whole zone goes down, the apps are migrated to a different zone. If the whole region goes down, the apps can be migrated to a different region. The 9s are so high that you don't have to worry about hardware issues anymore, unlike when you are running your own hardware.
That's the theory (or the marketing pitch, depending upon perspective).

The reality can be rather different[1][2][3].

1. http://money.cnn.com/2011/04/21/technology/amazon_server_out...

2. http://www.zdnet.com/article/amazon-web-services-suffers-out...

3. http://www.theregister.co.uk/2015/09/20/aws_database_outage/

Or it could be literally an old desktop computer sitting in someone's damp basement on a DSL connection. The problem with just saying "the cloud" is you can't tell the difference.
Generally when people say the cloud, they mean one of the big Public/Private cloud providers, not someone's basement.
Exactly right, but over the past six years there's been a strong (and accelerating) trend among developers of "lalala we don't want to know how anything works! give us an API and go away."

Most developers I've seen reject even learning about networks or DNS or operating systems or databases. Such willful ignorance boggles the mind, but they are praised because their goals are shipping half-broken things as rapidly as possible to flip upwards for those oh-so-tasty acquihire payouts.

We even saw this week how overconsumption of convenience APIs can put entire companies in danger when those privately controlled convenience APIs just decide to shut down one day. Convenience of immediacy always seems to trump connivence of long term stability.

>Exactly right, but over the past six years there's been a strong (and accelerating) trend among developers of "lalala we don't want to know how anything works! give us an API and go away."

I will argue that this trend has always existed. I'm sure you can find an x86/68k/z80 developer complaining that developers are going "lalala we don't want to know how anything works! give us an the C-language and go away."*

I'm sure there are developers who couldn't imagine learning C without learning x86, and saw developers learning C without learning x86 as "willful ignorance".

Good abstractions will cause developers to simply gloss over how they work.

As programmers, we need to know atleast one level below the abstractions to which we are programming to. For example, if you program in C you need to know a little bit of assembly, how objects are laid out in memory etc. This is how you write fast code and it helps with debugging too.

But if you are programming in C and notices that something goes wrong with the hardware, ( for example, an instruction does not do something that it is supposed to do ) you will have to ask for help since it is someone else's work that is faulty. Sounds reasonable ?

At least one level below. Hmm. That sounds more reasonable than full understanding. See my reply to nemo, though, for an alternative that I think is more reasonable. Basically, heuristics and simplified models.
Or work safely, effectively, and productively when taught how to properly use the abstractions. They can optionally be taught how they work underneath for better results. Yet, I don't have to teach people caches to tell them to group variables closely for performance. I likewise can give very basic explanations of stacks and heaps plus heuristics for using them. People still get the job done.

Functional programming proves my point even more where they don't know how the hardware functions or even use the same model. Yet, with good compiler and language design, they can make robust, fast, and recently parallel programs staying totally within their model. Most problems we pick up outside the abstraction gaps can be fixed in the tooling or with interface checks.

So, I think the common perception of people doing crap code while working within an abstraction is unjustified and even disproven by good practices in that area. Much like I would be unjustified in accusing assembly coders of being "willfully ignorant" or working within foolish abstractions because they didn't know underlying microprogramming or RTL. They don't need it: just knowledge of how to effectively use the assembly. Actually, I saw one commenting so let me go try that real quickly. :)

couldn't imagine learning C without learning x86

One difference: C->x86 is a static translation layer. Other network/system things dynamically change out from under your "designed" system and alter threat/security/disaster/reliability/consistency models in a potentially unpredictable combinational fashion.

Saying "cloud abstraction" or "I trust this API and don't care how it works" is basically committing every https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu... and just saying "X can't break because we use provider Y who guarantees they can violate the laws of physics for us!"

The good reverend Laphroaig preaches:

If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred Assembly, neither understanding what their programming languages compile to, nor asking to see how their data is stored or transmitted in the true bits of the wire. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our APIs, in our proofs, and in our memory models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.

An assembler elitist with a semi-fallacious argument. Let's rewrite that in view of a lower-level elitist to show it still looks true, shows love for assembler as foolish pride, and still fails to matter in face of good, high-level tools.

If the 0day in your familiar pastures dwindles, despair not! Rather, bestir yourself to where programmers are led astray from the sacred RTL/Transistor language, neither understanding what their assembly languages and microprograms compile to, nor asking to see how their data is stored or transmitted in the true bits of the CPU's network-on-a-chip and memory plus analog values and circuitry many run through at interfaces. For those who follow their computation through the layers shall gain 0day and pwn, and those who say “we trust in our assemblers, our C compilers, our APIs, in our proofs, and in our memory models and ISA models and need not burden ourselves with confusing engineering detail that has no scientific value anyhow” shall surely provide an abundance of 0day and pwnage sufficient for all of us.

Source: LISP, Forth, and Oberon communities who did hardware to microcode to language & OS all integrated & consistent. :P

Isn't this the nature of abstraction though? As the high level tools get increasingly powerful at solving common problems people will invest less in learning their underlying implementations.

I'm sure all the assembly programmers were complaining that the C programmers had no respect for "how anything works".

I mean a really simple solution (simple to say, maybe not to do) would be for package managers to require a "backup" repository from a different domain, than if you get a 500 error try the second remote repository. Use git for its advantages.
I think you mean a mirror, and many package managers use them.
And people give me shit when I argue that open source projects should include 100% of dependencies.
I think that's a bit crazy as well. This is a problem if your build process happens often and requires pulling external data. Ideally, you want a way to cache that external data, and a way to force invalidation of that cache.

Building, at least after the first time, should not require external access. There are security reasons for this as well.

So your proposed solution is one of the only two hard problems in computer science? That should be a solid clue that you're wrong.

"There are only two hard things in Computer Science: cache invalidation and naming things."

-- Phil Karlton

By "a way to force invalidation of that cache" I didn't mean automatic invalidation, I meant a way to flag that you want it to re-download dependencies and store them for later use. I'm not sure where you got the requirement that it needs to automatically determined by a computer from my comment. I was thinking the "cache" could be as simple as the person setting up the build environment downloading the dependencies and configuring the build to use them. That's a local cache, when discussing automatic downloading of dependencies during building.

Set up your build environment with whatever manual intervention is required so that it can run without downloading remote resources. Build as needed. There is no reason for, and many reasons against, downloading dependencies during the build process, but that doesn't necessitate duplicating those dependencies within your own source tree. As long as there are directions on how to download a specific, definitive version of the dependency, whether that is automated or not isn't really a big deal if it's done infrequently.

wow, i never realized cache invalidation was one of the ONLY two hard problems in CS
The quote is supposed to be, two hard problems: cache invalidation, naming conventions and off-by-one errors.
It's not, if you can not even run that first build then you actually have nothing to work on.

Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time.

With that, even if your first build run and go fetch those deps and can build at T1, it is not guaranteed at all that the build will work at T1+n.

There is a big difference between your team working from trunk and your team being dependent on other projects trunk.

Just because you're downloading your dependencies at runtime doesn't mean you have to have non-frozen dependencies or non-repeatable builds... that's one of the advantages of pulling dependencies out of a Git repository; specify a specific revision to build against and that code is guaranteed* to not change. Pulling dependencies from Git doesn't mean you're working against trunk.

Now, if you're doing this with mission-critical software, you should probably be maintaining mirrors of those dependencies locally on infrastructure you control, but, again, that's another of the things that Git makes easy.

You should never be dependent on a reference that can move, unless you're willing to accept the consequences (that includes branches in any version control system, tags if you don't have infrastructure to verify that they haven't changed, external non-version-controlled downloads, etc.).

Basically, what you should learn here is that you shouldn't build your business around a third-party service's continued availability. Especially if it's a third-party service where you're not paying for an SLA, like Github. Reproducibility of builds is a different issue, and including 100% of your dependencies in your own source repository is not the only solution to it.

* Barring a SHA-1 collision, which is highly unlikely with Git.

> It's not, if you can not even run that first build then you actually have nothing to work on.

Obviously you can run the first build. You wouldn't be using Github if you never got it working in the first place.

To clarify, setting up the build environment may require network access, but if the process of building requires it, there are many places where it can go wrong, both operationally and security wise.

> Also, not frozen dependencies means you are at the mercy on any dependencies changes breaking your build at any time. ...

I agree, but that's a separate discussion and doesn't really apply here. There's nothing preventing the pulling of a specifically tagged version for builds. If someone's build process that used Git for dependencies is not doing this, whether they are using Github or some internal server is irrelevant, the same problems apply.

how far down the stack do you go? do open source projects need to include their own compiler? what would compile it?
I suggested how far they need to go in context of Debian's reproducible builds posts:

https://news.ycombinator.com/item?id=10182282

That would solve readability, plenty of subversion, verifiability, much of portability, and perform anywhere from OK to good. Not going to happen but academics and proprietary software already did it to varying degrees. As post noted, traceability & verification from requirements to specs to code to object code is a requirement for high assurance systems. My methods, mostly borrowed from better researchers, are the easiest ones to use.

I don't have to bootstrap anything that my distro is already shipping. If I'm using GCC, my .spec file has a BuildRequires tag that tells rpmbuild to make sure an acceptable version is present (from my RPM mirror).

If I'm using some obscure tool that my distro doesn't package, that's when I mirror the version I'm using, and build my own RPM from source if it needs to be deployed to prod servers rather than merely run from rpmbuild.

* source code

* static libraries

* dynamic libraries

Provide compiled libs for the platforms of your choice. Preferably all three of Windows, OS X, and Linux. Users can issue pull requests if there is a platform or variant they wish to add.

Same here but I don't care all deps HAVE TO be in the repo, period.

In fact I go further than that, anything that a project depends on HAVE TO be "saved" somehow somewhere: use a special commercial tool ? save it, use some particular OS ? save the ISO, need a particular version of a compiler / SDK ? have an installer ready, etc.

But nowadays it seems dev program temporary stuff meant to last just few months.

If you personally run software for which reliability is important, absolutely you should maintain your own vendor repos. Open source projects are not in that position, and following your advice would lead to much harmful coupling and repetition.
That's a good point. I've been ignoring learning Git as long as I can but almost everything on my todo list heavily uses it. Or ties into it as you said. So, I'm going to have to bite the bullet and learn it.

Yet, I swore Git fans told me its decentralized design avoids single points of failures where everyone has a copy and can still work when a node is down just not necessarily coordinate or sync in a straight-forward way. This situation makes me thing, either for Git or just Github, there's some gap between the ideal they described and how things work in practice. I mean, even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime in my experience.

When I pick up Git/Github, I think I'll implement a way to constantly pull anything from Git projects into local repos and copies. Probably non-Git copies as a backup. I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

I'm sorry to be rude, but, it sounds like you should go learn Git and come back to this conversation.

The decentralized design does avoid single points of failures, and everyone does have a copy. So - check, check, great. Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization. But there is certainly no immediate coupling between the Git repository on your computer and the Github repository it's pulling from. It's not like Github being down in any way prevents you from working on code you've already checked out, unless you need to go check out more code.

(The same obviously may not be true for package managers and build scripts that are not running in isolation from your upstream repository, which is where the problems have arisen.)

"I'm sorry to be rude, but, it sounds like you should go learn Git and come back to this conversation."

It looks like it.

"The decentralized design does avoid single points of failures, and everyone does have a copy. "

So, like many decentralized systems I've used, a master node gets worked around by other nodes who communicate in another way? Or would some retarded situation be possible where...

"Unfortunately (maybe..) everyone has put their master repos in the same place, which somewhat counteracts the decentralization."

...one node going down could prevent collaboration? Oh, you answered that. That sounds better than CVS but shit by distributed systems standards. I'll still learn it anyway since everyone is using it. Probably in next week or two.

No, it's not the same as a distributed system with master/slave nodes. The child nodes can function entirely in isolation from the parent. If you wanted to, you could treat another coworker's node as your master and download/upload to that. It's usually easier to have a tree structure where the root is your master repo, its children are your build servers or whatever, and the leaves are development machines. But that's entirely reconfigurable.

It's not surprising at all that if you make a master repo at the root of the tree, and it goes down, then you can't communicate it. But it doesn't prohibit any communication between other nodes, or re-wiring the tree, and it definitely doesn't inherently block development work on any of the other nodes.

It just so happens, though, that people's build scripts and package managers like to refresh packages from the root and don't handle failures modes of that operation very well. That's the only place problems emerge - besides the obvious fact that if your public releases of software go through the root, and the root is down, then you can't release until it's up. But you could easily make a new root if you wanted to.

"It just so happens, though, that people's build scripts and package managers like to refresh packages from the root and don't handle failures modes of that operation very well. "

That's the critical part. So, countering this risk is apparently a manual thing if one uses off-the-shelf tooling for Git. I'll just have to remember to look at that if I do a deployment. Put it on a checklist or something.

>So, countering this risk is apparently a manual thing if one uses off-the-shelf tooling for Git.

Not so much off-the-shelf tooling for Git, its more off-the-shelf tooling for Node/Ruby/Go/Rust/PHP.

Nothing about Node's npm really requires it to depend on a single GitHub, in fact I think you can use any Git repo. Its just that most tend to use a single Git repo, and there is no way to configure mirrors.

This is a social problem, not a technical one.
It's a pebkac issue. The software is fully capable of having multiple remotes, but it's rarely used that way.
Is there an easy config for that? Suppose I want to push to eg github and bitbucket (without sharing my creds with ifttt or similar)? Is a post-receive hook on a local pseudo-master the way to go?
See, for example, here: http://stackoverflow.com/questions/14290113/git-pushing-code...

    git remote set-url --add --push origin git://original/repo.git
    git remote set-url --add --push origin git://another/repo.git
Lol. Nicely put.
Git works as advertised, but when all your build processes start with a sync from the upstream master (the equivalent of "svn up") that a lot of build scripts required that to work, then they've thrown away that advantage when building.

Everyone with a checked out repo should have been able to develop and commit, branch and merge locally fine though.

Thanks for the clarification. This is the exact sort of thing I was wondering about.
> either for Git or just Github, there's some gap between the ideal they described and how things work in practice

The hub-spoke topology is the easiest way of distributing source code to a lot of people. If the hub goes down, this is what happens. If that leads to a halt in productivity, then that is a failure in contingency planning. Git gives you many tools to distribute your workflow, but that won't save you if your workflow is centralized around Github.

Granted, sometimes you don't really have a choice whether to depend on Github, such as when working with language package managers. Perhaps that goes to show that mirroring and resiliency should be a design consideration in those tools, but it's not a shortcoming of Git itself.

> even CVS or Subversion repos on high availability systems didn't have 2 hours of downtime

It's easier than ever to have HA with a DVCS: clone the repository somewhere else and keep it in sync with commit hooks.

Large FOSS projects (should) do this by keeping a self hosted repository, and mirroring somewhere else like Github, Bitbucket, etc. Internally, an org should be able to quickly stand up a SSH or HTTP server for the purpose, or have collaborators push-pull directly from each other. Worst case? Send patches. Git apply works really well, and you might be surprised at how clever git-merge is when everyone finally syncs up.

That's what it means to be distributed: there is no real concept of a "central" node, unlike Subversion. Every local checkout has a full copy of the repository history. Any centralization is a (somewhat understandable) incidental artifact of how Git is being used.

Makes sense. I'll try to remember that for my future checklist. Thanks for the details. Btw, you're site is down on my end from 2 browsers on my desktop and one on mobile. Might want to look into that as rest are working.
> Btw, you're site is down on my end

Hah, because it's been defunct for a while now. Thanks for the reminder, removed it from my profile.

Cool
> I used to also use append-only storage for changes in potentially buggy or malicious services. Sounds like that might be a good idea, too, to prevent some issues.

In a certain sense, git is "append-only". If you change a commit in history, every ancestor commit will have its SHA hash changed. Naturally this will conflict with other copies of the repository.

For backups you should do a "git clone --bare" which checks out the internal git structure with data and history, but not the actual files.

I figure it's append only at protocol level. Usually a smart idea for SCM. Is that still true when the whole datacenter goes down in mid-operation? Typically varies from implementation to implementation of the concept.
Git is to GitHub as JavaScript is to Java. Though their names are similar they are very different things.
git != github
Hence Git/Github in my comment. I already know there's a difference. I just don't know much more than that until I learn the two.
Github is to git as Sourceforge is (used to be) to subversion, but with a better UI.

And yes, there have been concerns raised about what would happen if Github took a turn like Sourceforge, which usually get brought up when information about new shady practices at Sourceforge come up (or they get rehashed here).

Makes sense. I'm quite interested in seeing where it goes over time. I think it will depend a lot on the nature of the company. If it's VC-funded & aiming for acquisition, then there's a decent chance of Sourceforge history repeating. Otherwise, it might stick around as a beneficial ecosystem. Time will tell.
If you understand the difference between the two, you'd realize your comment makes no sense. The fact that github went down due to a power failure has nothing to do with git as a solution.

The fact that everyone uses git more or less the same as svn is the problem. Git is decentralized, but because so many people rely on github most don't ever use the decentralized aspect to it.

If you understood my comment, you'd know I don't understand the differences between the two that much since I haven't studied them yet. Been clear in a few comments on that. The reason I associate them here is that most projects I see don't just use Git: they use Github, too. So, I briefly wonder and get feedback about how inherent Github-style downtime was or if it was configuration/deployment issues.

Several commenters helpfully described how Git can easily prevent stuff like this and that project-level stuff is why this is a liability. That's good to know as it's already a selling point to management types for a solution like it. Can just ensure the problem doesn't show up in a local deployment by a wiser configuration.

I understood your comment just fine, but the opinion you had formed was based on false assumptions, so I was trying to correct it, that's all.

Personally I try not to form strong opinions about things I haven't actually learned or understood yet.