| Good write-up considering that you had to manage the adventure of learning and scaling Docker along with the pressure of commitments from clients. I would like to share some pointers on how I and my team deals with the issues that you mentioned in this post. 1. Orchestration :- For us, Swarm never became a choice, as we started our journey adopting Docker in early 2015. That time there was no Swarm. We resolves to Mesos and Marathon for our orchestration. Both worked out well for us in the long run. We have production systems running this setup for last few months. Swarm is mature now, and with Docker 1.12 its been made more easier to use. The good thing with Swarm is that you could avoid adding another new system in your infrastructure like Mesos, Kubernetes etc. if you have reasonably simple requirements. We found that Orchestration also established service discovery and routing capabilities for us. We use HAProxy and Mesos DNS for our routing and discovery needs. Marathon-lb project is used to allow us to reconfigure HAProxy everytime a new Docker Container is deployed by the CD Pipeline. Marathon manages our service ports across the cluster, and every new Docker container gets its own unique service port. This service port is then informed to HAProxy, and reload happens. This setup worked good for us, although we had some initial trouble. We also practice Zero downtime deployment with our stateless services using the ZDD script inside the Marathon-lb project. 2. Running out of disk space :- This is a common problem especially with the idea of rebuilding and deploying disposable containers with the CD pipeline. In our case, we use Monit to gather system wide metrics at all times. We use a Garbage collection script that we developed in house to remove the old Docker images and Containers periodically whenever Monit detects file system usage beyond the set thresholds. We do continuous production deployments as often as we need, so this allows for our Docker image diff to be minimal. We avoid big bang releases so that the latency for docker push on the Build server and docker pull on the cluster is minimal. The Spotify Docker-GC project is a good choice according to me. 3. Docker registry :- We use Docker Registry container that runs on the Mesos cluster via Marathon. The Docker Registry is backed by a shared volume on the Docker hosts. We share the same volume on all Docker hosts in our cluster. So, if the registry crashes on one host, Marathon is able to redeploy the Registry Container on another host which has the access to the shared registry volume. We tried moving our Registry backed to S3, but never in production. For the systems we manage, we need the Docker images in house due to compliance requirements. Therefore, we could not use Gitlab Registry or Docker hub for our production deployments. But I heard good things about Gitlab registry. 4. Logging :- We use Logspout on each Docker hosts. It forwards the logs to our managed Logstash and further to Elasticsearch service. We use Kibana for log dashboard. Logs are rolled over on each Docker container, so that we avoid storing the logs on the host for long time. However, any distributed logging introduces log ordering and latency issues. So, we are tackling them as of now through some optimisations. 5. Dependency and Base Images :- We use hierarchical model of managing Base Images : One top-level Registry (Global), and isolated docker registry for each project. Every Base image gets into our Top-level Docker registry which is curated, and the associated Dockerfile for that Base image is checked into our Git Repository. We insist using these Base Images from our registry for all projects. Each project can then inherit the base image, and customize to the local needs of the project. We follow CI and CD for our Base Images as well. Each project gets a notification when the Base image changes. They are free to opt in or opt out. This model works for us, but many not be that interesting for smaller setups. I had written about it here:- http://thenewstack.io/bakery-foundation-container-images-mic... 6. DB and Persistence :- We avoid running stateful services in production on Docker Container, as we have qualms about the persistence support in Docker. However, we do use Docker volumes for all purposes in non-prod environments including CI, Elasticsearch and other services. We are very interested to pursue this further with ClusterHQ and Flocker based offerings in the Docker ecosystem. I may blog about it in the coming days on this. But so far, I don't have any production experience with the Database in Docker. But I am optimistic that this will happen soon. 7. Longer build times :- This is correct as per your assessment, but widely varies across deployments. As said earlier, we want the team to have faster build times, so we build and release as often as possible.This allows to not have to deal with build latency. We use lightweight Base images like Alpine, and prevent the use of Configuration and Package mangers like Puppet inside the container. In our base images, we prevent bloating by avoid installing irrelevant packages. This has backfired some times in production, but we have found ways to go around it most times. Overall, I am aware of the challenges that you had, and can connect with all of them. Docker is not the panacea for all the infrastructure woes, but its certainly gives the taste to me and our team on how software development and delivery will change for good in the coming days. |