Hacker News new | ask | show | jobs
by b5u 2691 days ago
I've also been deploying services on ECS for close to a year now and would like to address some inaccuracies the author seems to have made: 1) in 'Surprise 1' the author offers examples of CPU Utilization (or target) is between 80% and 95% without mentioning the reserved CPU/memory (aka size) of those tasks (under the assumption that he's using the Fargate launch type). The 'size' of a task also influences the average CPU target utilization. For instance, if a task requires the reserved CPU of 4 vCPUs, then a spike from 80% to 95% is handled differently than when a task reserves 1 or 2 vCPUs. The same goes for memory. In an example setup I'd use 1-2 vCPUs sized tasks with a service-wide target avg. CPU Utilization of 70% along and a StepScaling policy which adds 10% more tasks if the service avg. CPUU falls between 70-80, 20% if between 80-90 and 25% if above 90. My strategy has been being smaller-sized tasks, lower service avg CPU utilization (compared to 80%-90%) and shorter evaluation periods/datapoints for the scale-out CW alarms (minimum being 60 seconds IIRC). The short evaluation periods/low number of datapoints of the CW alarm allowed me to handle spikes reasonably fast.

2) in 'Surprise 3' the author claims that the Terraform's aws_appautoscaling_policy 'is rather light on documentation'. Since I am a user of Terraform for several years, I find it inaccurate mostly because of the several examples available in the documentation https://www.terraform.io/docs/providers/aws/r/appautoscaling... as well as many more when doing a Github exact search for "aws_appautoscaling_policy" language:HCL will reveal many, many more examples from open-source repos (some with permissive licenses too). I'd created a custom ecs-service TF module which creates for each service (optionally) an ALB along with listeners and the attached ACM-issued TLS certs and TGs, the scale-in/out CW alerts with configurable thresholds/policies, SGs, Route53, etc. allowing one to quickly configure and launch an ECS service fast and reliably.

Regarding the scale-in, I typically also have that at intervals between 5-15 minutes to avoid an erratic scale-in/scale-out 'zig-zag' happening even at the cost of briefly over provisioning.