Hacker News new | ask | show | jobs
by spydertennis 4524 days ago
Doesn't it make more sense to scale based on New Relic response times? Why are you guys using # of users? Depending on the page they are requesting and how the requests are clustered that could produce vastly inferior results.
2 comments

We find that neither New Relic nor Analytics gives the full picture: some of our pages are heavily cached, others (e.g checkout processing) are computationally expensive, database heavy and communicate with other systems (e.g payment processors) that can be a big bottleneck. Both New Relic and GA tend to just average those together (although with GA you can create new views that focus on specific pages). You are right that 'number of visitors on site' does not reflect our site performance in every respect.

We first conceived of Dynosaur as a plugin-based autoscaler (with GA and New Relic plugins to start with), but we've found the times we really need to scale fast are the times we have a lot of traffic generated from press stories etc (like this from today, if you will excuse the shameless plug: http://dealbook.nytimes.com/2014/01/21/a-start-up-run-by-fri...) and using the analytics live API allows us to react a little quicker than if we waited for New Relic to tell us our response times are getting slow. So far, we're happy enough with just a Google Analytics plugin.

One possible improvement would be to scale differently based on different traffic / performance metrics across the site. I think New Relic or other performance instrumentation would be very useful for that.

Nice work, looks pretty solid. If you want to offload the responsibility of determining which GA metric events signify a potential spike, you could abstract it out and instead make the plugin use GA intelligence event alerts setup by your analytics team. This would help keep the respective subject-matter experts in their realms of expertise ideally allowing for a more on-going tailored approach to what, where, and how can trigger scaling fluctuations and the dev team isn't responsible for on-going management of the scaling trigger rules (well, to a certain extent).

Just a thought. I know you abstracted it out the way you did so as not to tie it to just GA, but if GA is your analytics platform of record, it could be worth pursuing. Cheers!

Performance engineer here.

Scaling should occur on actual traffic/user throughput statistics not on response time. Response times can increase for a variety of reasons, of which only one of is increased traffic load. For example, response times can increase based on back-end database contention or thread contention.

Of course whichever configuration is chosen, it is important to load test your application so you understand its performance profile and that all contention points are understood.