Hacker News new | ask | show | jobs
by bkanber 3652 days ago
So, you build a web app and it gets popular. It needs one load balancer, 5 app servers, at least two database nodes for replication, a redis cluster for caching and queuing, an elasticsearch cluster for full text search, and a cluster of job worker servers to do async stuff like processing images, etc.

In the ancient past, like when I'm from, you'd write up a few different bash scripts to help you provision each server type. But setting this all up, you'd still have to run around and create 20 servers and provision them into one of 5 different types, etc.

Then there's chef/puppet, which takes your bash script and makes it a little more maintainable. But there are still issues: huge divide between dev/prod environments, and adding 5 new nodes ASAP is still tedious.

Now you have cloud and container orchestration. Containers are like the git repos of the server world. You build a container to run each of your apps (nginx, redis, etc), configure each once (like coding and committing), and then they work identically on dev and prod after you launch them (you can clone/pull onto hardware). And what's more, since a container image is pre-built, it launches on metal in a matter of seconds, not minutes. All the apt-get install crap was done at image build time, not container launch time.

Things are a lot easier now, but you still have a problem. You're scaling to 30, maybe 50 different servers running 6 or 7 different services. More and more you want to treat your hardware as a generic compute cloud, but you can't escape that, even with docker, your servers have identities and personalities. You still need to sit and think about which of your 50 servers to launch a container on, and make sure it's under the correct load balancer, etc.

That's where Kubernetes steps in; it's a level of abstraction higher than docker, and works at the cluster level. You define services around your docker containers, and let Kubernetes initialize the hardware, and abstract it away into a giant compute cloud, and then all you have to do is tell kubernetes to scale a certain service up and down, and it automatically figures out which servers to take the action on and modifies your load balancer for that service accordingly.

At the scale of "a few servers", Kubernetes doesn't help much. At the scale of dozens or hundreds, it definitely does. "Orchestration" isn't just a buzzword, it's the correct term here; all those containers and services and pieces of hardware DO need to be wrangled. In the past it was a full time sysadmin job, now it's just a Kubernetes or Fleet config file.

Disclosure: I'm currently writing a book on Docker. Disclaimer: I have not had my coffee yet.

Edit: Since someone asked, I'm writing a book called "Complete Docker" which will be published by Apress. I don't know the exact pub date that Apress will launch it on, but I expect it'll be available in October.

16 comments

Given the quality submission, I feel like it would be fair to include a link to the book your writing (or a title so we can search for it when its done). Disclosure: I want to buy it.
Wau thanks, that is an excellent explanation. I appreciate that it start with things I know and understand ("so you build a web app"/"bash scripts"/"puppet") and builds on that explaining what problems each consecutive steps /layer of abstraction/ solves.

Now I wonder.. how many projects actually needs these kind of solution when even StackOverflow can do without it (they are in the range of few servers)? I would imagine it would be only few top popular web apps/services, but by popularity of these posts it looks like it is probably a lot more...

How many people actually need it? Nobody! We managed to run servers for years without containers or orchestration tools. Docker is a new technology.

Does it make things better, though? Yes. Yes it does.

Kubernetes was designed at Google where they reaaaaally feel their scale problems day to day.

I see Docker/Kube/CoreOS/ etc as the natural evolution of where we were already going. Bash -> puppet -> vagrant -> docker -> kubernetes. Less abstract to more abstract.

So it's actually "only" an incremental evolution in terms of managing the server ecosystem. But it's a revolutionary improvement in how we think about server ecosystems, which is why many people struggle with Docker et al at first; it's a brand new mental model.

I'd like to add that this isn't necessarily even about huge scale. It's about scaling deployment patterns too, which is different than scaling services.

Sometimes you need to have the ability to run more services when your "web-scale", but even if you're not at that point, scaling how you deploy new versions, and new services is still important.

Tools like docker, and Kubernetes really help with the application delivery aspect, and really enable you to rapidly iterate on your project.

You might have a site with 100-1000 users, not huge by any means, and 1 server could probably handle all of your needs. But once you start adding other components, perhaps redis, or a runtime like nodejs, those can all be managed, but if you need to rapidly iterate, something like Kubernetes/docker can make updating or deploying these easier in the long run.

Very few actually need them, but many companies have scale-envy and instead of getting bored with their tech stacks, stay inspired/passionate by playing with the shiny new tech running their low/medium traffic website.
In addition to bkanber's comment, this technology is helpful sooner if you're using a microservices architecture, where you have very many small servers instead of a few beefy ones. Instead of "Time to spin up a new box" it might be "Time to spin up 3 box A's, 2 box B's, and a box C, D, and E...".
I think it's clear that the vast majority of profitable businesses, online or off, don't need the scale these services provide. But providers to these business can benefit from the scale to provide better, cheaper services. For example, the resurgence of the Static Web: ie site generators (ie Jekyll) and vendors offering hosting for these services (ie bitballoon, netlify, aerobatic, neocities, etc...).

Really, why should cost anything to have basic, low-traffic, site on the internet today? That's the quetion I ask myself.

Containerization on it's own can have benefits even in smaller deployments - separating your different server concerns to the bare minimum can be very useful.

There's also other aspects, like being able to spin up the newest PostgreSQL within minutes, and when you're done you rm the container and image and have no trace and/or side effects.

There's some pretty useful multi-container configurations out there too, like a dockerized gitlab environment: https://github.com/sameersbn/docker-gitlab

I think stackoverflow.com is not a correct example. Because most of data on stackoverflow.com is static.
True, although I'd say most users are logged in, so that does make caching much harder.

stackoverflow is just a nice example that you don't need dozens of servers, even when you reach their scale.

I do believe Brook's The Mythical Man-Month does also applies to computers. In that communication overhead soon outweighs the benefits of adding more computer nodes.

We need a HN Best-of comment section and we need to put your comment into it.
You could start a whole separate website of just "Best HN comments".
There's no need for a separate website, HN already provides a collection of best comments:

https://news.ycombinator.com/threads?id=cloakandswagger

I see what you did there
That's awesome. Are these just comments with a high karma, or is there some other magic going on?
Seems like a combination of length and karma — or maybe length and karma just happen to correlate? There are short ones in there too. Hmm.
My guess is it's a simple time-decay-weighted karma-based formula. That is, karma * e ^ -age. "Best" vs "top" may be the difference in net karma vs gross positive karma.
> All the apt-get install crap was done at image build time, not container launch time.

Something I never understood with containers: where do they store persistent data, e.g. MySQL's /var/lib/mysql - and how does upgrading work, i.e. when the apt-get postinstall script runs transformation on the persistent data, how is the transformation applied to the "clones"?

"Volumes" are the Docker construct used to store persisted data. You use a volume when you want to decouple the lifecycle of your data with the lifecycle of the app. You can either map the volume to a directory on the host (ie, map /var/lib/mysql to ~/data/mysql), or you can allow Docker to manage it (where it'll live in /var/lib/docker/volumes/blah/blah).

You don't upgrade a running container. Imagine that containers are immutable; to launch a new, upgraded version, you re-build the image in your build/dev environment, and re-launch the image into production. If you're using a volume, you get to use the same backing data.

It's rare that apt-get postinstall will affect any data that you would persist -- app-specific data you'd keep in the image/container, and mysql data for instance you don't really want apt touching anyway. But if a data migration is necessary you'd either manage it with a "utility container" (image that's designed to run a script then stop, rather than run and keep running).

Is it true that HN runs on a single server, running FreeBSD?
Yes, it runs active passive, and its written in a Lisp dialect.
What does "active passive" mean?
two servers, one "active" serving the website, one "passive" ready to serve the website when you bring the first down and perform some switching (DNS or moving an IP address).

Load balancing two separate servers you can consider as active-active for the terminology.

Just because of how clearly you have explained it, I would surely want to buy your book when it releases. As of now, I know nothing about Docker except one or two buzz lines. And by reading your book when it releases, I am sure I can get a full picture. BTW, who will be target audience for your book?
Since the book is called "Complete Docker" it's broad in scope; I pitched the book to my editor as spanning beginner, intermediate, and production topics. I won't really go into advanced Docker in the book.

Right now the first several chapters serve as a "I'm a programmer but know nothing about Docker" guide, the mid of the book dives into Docker-specifics (how exactly volumes work, how Docker networking works, etc), and then it graduates into the Docker Ecosystem, dedicating some time to covering tools like Kubernetes, CoreOS, Fleet, Amazon ECS.

The latter portion of the book is a number of recipes: how to get a Ghost blog running behind an nginx proxy; how to launch WordPress with MariaDB; how to launch an ELK stack, etc.

Hopefully your book will mention in big bold Red letters why it's crucial that all the files except for data inside of a Docker image be OS packaged, every single one of them, especially the configuration files, and show how configuration management can be performed efficiently and at scale with OS packaging, as well as how one might perform change management with the whole lot.

And hopefully your book will mention SmartOS and Triton and zones providing full isolation for Docker.

As someone more in the development end of things, this was an excellent summary, thank you.
Despite your lack of coffee, that was a pretty good read. Thanks for your write up.
Most helpful sentence:

"At the scale of "a few servers", Kubernetes doesn't help much. At the scale of dozens or hundreds, it definitely does."

So now I know I don't need to learn much about Kubernetes unless I have "dozens or hundreds" of servers to manage.

I'd say: you should learn as much as you can! That way you can decide for yourself whether or not Kubernetes is for you.

One thing Kubernetes IS helpful with at small scales: portability. If you're fully Kubernetes/Docker, then you won't get locked into (eg) AWS's ecosystem. It's relatively easy to pick up an entire Kubernetes cluster and move it from AWS to DigitalOcean to private hardware.

So even if you're at small scale today but want to design for portability, I'd definitely look into Kubernetes.

Do you have a wait list or place I can go to get notified once the book is published?
Hrm not yet. I'll probably put a mailchimp form on my blog. If you use twitter follow me @bkanber; I tweet rarely but I will tweet about any mailing lists or impending book releases. Appreciate the interest!
The way I think about it: I remember in the 1990s when you needed to put up some web pages, you had to bring up an entire "web server", and that's all the server would do.

Now, think about it from a reductionistic engineering perspective. What do I really want this server to do? Well, it accepts TCP connections, parses a request to figure out which file (at the time, it was all files) to server, sticks an HTTP header on the file, and shoves it down the socket.

This task is so simple that a skilled network programmer can nowadays literally bash together a 1990s-level static HTTP server in an day, with nothing but a socket library and some basic string handling. (It may not be great and it probably is insecure, but, well... see also "1990s web server"....) The code to do this is perhaps in the dozens or hundreds of kilobytes.

But that's not what I had. I have a full computer that physically needs to live somewhere. It has hardware ethernet and hardware graphics cards and a physical monitor and a power supply and RAM and, basically, hardware hardware hardware, the failure of any one of which means the system is either difficult to change or outright down. I have an entire Windows operating system, which even in the 1990s was hundreds of megabytes of code, endless code. Code for a windowing system, for pete's sake. Code for the audio subsystem. Code for accessing the hard drive. Code for access code that accesses code. Code code code code code, a bug in any one of which means the system may be down or insecure. My website, which at the time was quite likely in single-digit megabytes in size, was a tiny directory lost in a sea of files on the hard drive.

Over the past 20 years, the commodity hardware world [1] has been slicing away at the fact that several dozen kilobytes of code are being accompanied by hundreds of megabytes of support and literal pounds of physical hardware. Hardware went first with VMs. VMs got lighter and lighter. Lightweight hypervisor solutions sliced away at the heaviness of the VM. Containers slice away at the OS. Things like Kubernates slice away at the idea of a container living somewhere physically.

We're trying to free that several dozen kilobytes of code to be just several dozen kilobytes of code, as flexible and easy-to-deploy as several dozen kilobytes should be, if you weren't mired in the world of hardware and OSes and code and strong physical connections.

(Data storage is more complicated, but in a lot of ways, the same principles are in play.)

Operationally, containers are very exciting. However, in terms of "magic technologies I don't understand", I don't think they're worth stressing about "getting old" or anything. It's mostly "just" a big pile of practical considerations in trying to not just build that world, but in some sense also undo decades of grinding-in of the physical world to our operational considerations. If you want worry about getting old and out-of-date, worry about that let you do something that you couldn't before, like GPU programming or deep learning.

[1]: Which must be specified because mainframes beat us all here decades ago, in a lot of ways.

^^^THIS is an absolutely great comment. I realise this isn't the major thrust of it, but it's the single best description of what containerisation is, and what problem it solves, that I've ever seen.
Thanks. Not sure you need coffee. Your comment was awesome.
This is why I read HN...
this is one of the best comment on HN! Love it!