Hacker News new | ask | show | jobs
by osigurdson 1340 days ago
>> deployed to over 2000 locations

Were there 2000 independent systems / SQL server instances running or just one? 2K separate deployments to manage (with 1K users each), does sound a little scary. Of course, perhaps that is not what is going on at all.

3 comments

Which is actually kinda funny, because some of the "complex" technology the OP is railing against allows us today to manage thousands of databases both easily and efficiently... IF the systems are built with a more current approach. This is why I try to understand ALL of the motivations for disruptive change and not immediately assume incompetence and self-interest bordering on criminal.
Sometimes complexity is conflated with lack of familiarity. Instead of using the term "complexity", we should state what the actual problem is.
Option one:

Write a cloud formation / terraform template that involves O(1) machines and deploy 2000 identical copies.

Option two:

Write a template that deploys O(N = 2000) interdependent services across roughly 3-10x as many machines, and deploy one copy.

From what I can tell, you are arguing for option 2. It is strictly worse than option one. In addition to being more complex, it has a few nines less reliability, and costs 3-10x more for the hardware. The dev and CI hardware budgets is going to be 10x more because you can't test it on one machine, and it has bugs that only manifest at scale.

Source: I do this for a living, and have been on both sides of this fence. Option 1 typically has 5-6 nines (measured in chance a given customer sees a 10 second outage), option 2 never gets past 3-4 nines (measured in at least N% of customers are not seeing an outage).

The modern vs old technology debate has nothing to do with this tradeoff. If you want, you can build option 2 with EJB + CORBA on an IBM mainframe, and option 1 with rust and json on an exokernel FAAS.

I'd argue for Option 3, which is to try to understand the workloads placed on the original system and then design the new system based on this. I think having 2K independent database servers would not normally be optimal for 2M users, but it is possible.
If the old system is exceeding uptime SLAs, meeting all business needs, and coming in under the budget for such an investigation (it sounds like the total operations budget was less than 10% of one engineer's time), then why bother?
I don’t know the situation, not touching it may have been optimal. I’m suggesting that if it was going to get re-written, I would at least study the basic parameters of the problem by reviewing the workload of the current system.
I wasn’t clear in my description, unfortunately.

This was a multi-tenant centrally hosted application. There were 2000 sites served, each with kiosk PCs and some associated special-purpose hardware.

The actual application code ran in just four virtual machines in two data centres.

No templates, no Terraform, no microservices, etc…

Just vanilla ASP.NET on IIS with SQL Server as the back end.

The efficiency stemmed from having a single consolidated schema for all tenants with a tenant ID as a prefix to every primary key.

Shared tables (reference data) simply didn’t have a prefix.

The vendor product that replaced this was not multi-tenant in this sense. They deployed a database-per-tenant, and lots of application servers. Not one per tenant, but something like one per ten, so two hundred large virtual machines running twenty instances of their app.

Multiply the above for HA and non-production. The end result was something like a thousand virtual machines that took several racks of tin to host.

Management of the new system took serious automation, template disk image builds, etc…

The repetition of the reference data bloated the database from 50GB to terabytes.

It “worked” but it was very expensive, slow, and difficult to maintain. It took them several years to upgrade the database engine, for example.

That task for my version was a single after-hours change. Backup or rollback was about an hour, simply because the data volume was so much lower.

The simplicity in my solution stemmed from a type of mechanical sympathy. I tailored the app to the customer’s specific style of multi-tenant central hosting, which made it very efficient.

Both approaches are valid for multi-tenancy, with their own pros and cons.
Of course, it is hard to say without knowing more about it, but it seems that jiggawatts solution is closer to optimal than the second one. The 50GB database could fit on a USB drive after all and we know empirically that a single SQL server database was able to handle the requests since the old system worked.

Also, the fact that a consulting company was able to turn a part time gig for one person into a $100M+ project at the taxpayer's expense is very frustrating.

typical technical forum, thinking they know the best solution based upon a 1 paragraph description.
That’s all we have presumably.
2000 SQL server licenses sounds terrifying
Both the old and new systems were using licensing based on processor cores, not VMs or instances.

If I remember correctly, my version had something like 8 + 8 cores in an active/passive configuration where the passive node is free. There was also a single dev/test server also with 8 cores, but that's free too.

The replacement used a few hundred cores shared by the various instances and environments. If I remember correctly, they had something like 10-20 databases per virtual machine, and then about 5 virtual machines per physical host. The cores in the physical host were licensed, not the logical layers on top. (I can't remember the exact ratios, but the approach is the point, not the numbers.)

The "modern" cloud approach of having dedicated VMs for a single thing is actually terribly inefficient, and that approach would have bloated out the above to thousands of VMs instead of "merely" a few hundred.

The correct architecture for something like this -- these days -- might be to use Kubernetes. This provides the required high availability and instancing, while efficiently bin-packing and deduplicating the storage.

Still, you can't Helm-chart your way out of an inefficient application codebase.

Again, for comparison, my version could run on a laptop and had about half a dozen components, not thousands.