Our Nightmare on Amazon ECS

 In Data Engineering, Platform Technology

 

Here at Appuri, we have a large number of small, single-purpose services that make up our ETL pipeline, API and UI. We started from large, monolithic repos and gradually migrated to this microservices pattern, not because of any philosophical bias but because it fit our work style. By and large, this has worked well with all the known pros and cons of microservices. But I’m not here to debate microservices. I’m here to tell you about our nightmare on Amazon EC2 Elastic Container Service (ECS) and how we saved1 ourselves by moving to Kubernetes.

NOTE: In general, we love AWS. Also, your mileage with ECS may vary. For example, Segment had a great experience with ECS and apparently none of our complaints.

There’s also the wonderful Convox project which contains a lot of great workflows on top of ECS. When we started using ECS, Convox wasn’t far enough along to meet our needs.

And so, it begins, with a love of managed services

I love managed services. For example, we don’t run our own Postgres server – we use Amazon RDS. We also don’t run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services.

In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS).

Amazon ECS fit the bill because of several promises:

  • With ECS, you simply launch Docker containers.
  • ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.
  • You can do rolling deploys. Neato, deployments with zero downtime!
  • API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.
  • ECS works with vanilla EC2 instances. This is a nice plus, as we don’t have to learn a new PaaS – just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.

First impression: wow, it’s missing a LOT of stuff.

My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That’s all good, we do that, too. However, it was sad to see that these key features were missing:

  • No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable — for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.
  • No central config. ECS doesn’t have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.
  • Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (aws ecs update-service –desire-count N) but the ECS CLI is just not very powerful.

Despite these missing features, we decided to move ahead.

I have made a huge mistake

Our first “oh crap” moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail!

We opened a forum post and response from the team wasn’t on target. Apparently they don’t believe in treating environment variables as sensitive quantities.

Now, we could have built yet more infras1tructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start – in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?

What killed ECS for us

We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when three issues remained unaddressed:

  • ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that’s part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail – this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven’t been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.

This is a Slack search results view of just some of the times we’ve seen this problem happen. This problem became so pervasive that we started restarting agents periodically to get around the failure:

 

You know you’re going crazy when you restart a service every hour to fix its bugs.

  • Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.
  • Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.

Adios ECS, hello Kubernetes

After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue with kube-proxy not routing traffic correctly, but a restart fixed the issue and it hasn’t cropped up since. We haven’t looked back!