| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by OriPekelman 1369 days ago

So, I do a have a war story about something like this, possibly at a worst state. And possibly with somewhat higher stakes (around 400m$/year) at the time. I came in as a consultant with my own "parallel implementation team". In my case I was somewhat lucky because most of the system was composed of batch jobs. They did have "frameworks" with "ORMs" but they had 4 or 5 of them, with many files being pinned to some older version. Which meant that actually there were dozens.

There were thousands and thousands of business rules no one knew why they were there and if they were still relevant. I remember one fondly. If product=="kettle" and color=="blue" and volume=="1l" then volume=1.5l... This rule like many others would run on the millions of product lines they would import daily. And the cutest thing in the system was that if any single exception happened during a batch run... the whole run would fail. And every run would take close to 15 hours (sometimes more).

Not going into details ... But they couldn't afford the run going over 24 hours... And every day they were inching closer.

Similar to OP they extensively used EAV + "detail tables" to be able to add "things" to the database.

The web application itself was similar but less of a time-bomb. It was using some proprietary search engine that was responsible for structuring much of the interaction (a lot of it was drill-down in categories).

Any change on the system had to happen live with no downtime. Every minute of downtime was $1,000 in lost revenue.

The assumptions we had were: 1. At some point the system will catastrophically fail so 100% of the revenu will be lost for a long time. 2. Even if it were possible to rewrite the system to the same specs (which it wasn't because no one knew what the system actually did) such a rewrite would probably be delivered after the catastrophe.

The approach we used was to 1. Instrument the code - see what was used what wasn't. We set some thresholds - and we explained to the stakeholders they were going to be potentially be losing revenue/functionality. And we started NoOping PHP files like crazy. Remember, whatever they did the worse thing they could do is raise 2. Transform all batch jobs to async workers (we initially kept the logic the same) - but this allowed us with 1# to group things by frequency. 3. Rewrite the most frequent jobs in a different language (we chose Ruby) to make sure no debt could be carried over. NoOp the old code. 4. Proxy all http traffic and group coherent things together with front controllers that actually had 4 layers "unclean external" - whatever mess we got from the outside. "clean internal" which was the new implementation. "clean external" and "unclean internal" which would do whatever hacks needed to recreate side effects that were actually necessary. The simple mandate was that whenever someone did any change to frontend code they needed to move the implementation to "clean external". 5. We ported over the most crucial, structuring parts to Ruby as independent services (not really micto-services just reasonable well structured chunks that were sufficiently self-contained). If I remember correctly this was something of the size of "User" and "Catalog browser" the other things stayed as PHP scripts. 6. And with savagery any time we got the usage levels of anything low enough.. we'd NoOp them.

Around a year in there was still a huge mess of PHP around but most of it was no longer doing any critical business functions. Most of the traffic was going through the new clean interfaces that had unit tests, documentation etc. I think that 100% of the "write path" was ported over to Ruby. A lot of reports (all of them?) and some pages were still in PHP.

I don't think anyone ever noticed all the functionality that went away. We had time to replace the search engine with Elastic Search. It wasn't clean by any means but it was sturdy enough not to have catastrophes.

The company was bought by some corp around that time... and they transitioned the whole thing to a SaaS solution. I was no longer involved for quite awhile so I only heard about it later. But we bought them that extra year or more.

So .. as far as recommendations go: 1. Instrument the code (backfire.io !) 2. Find bang for the buck and some reasonable layer separation and do it chunk by chunk. 3. Don't try to reproduce everything you have. Go for major use-cases 4. Communicate clearly that this is coming with functionality loss. 5. Be emotionally ready for this being a long long journey.