Hacker News new | ask | show | jobs
by ergest 2623 days ago
They built a pipeline that complicated for 100gb? That’s insanely over-engineered! Very typical of engineers who just want to pad their resume at the expense of unsuspecting business people. I’ve worked with single server data warehouses on SQL Server that were 10x in size and served the entire company.

I don’t know what your data looks like, whether it’s just transactional or a combination of transactional and raw server/app logs. You could ETL the raw logs into an RDBMS like Postgres but you have to worry about maintaining it though and it doesn’t sound like you have enough resources for that. To do that you need help from IT/ops to set up a replica of the live server so it can be queried without disrupting transactional operations and then write ETL code or use a service like Stitch or Panoply.

You can also use a cloud platform like Google BigQuery or AWS Redshift to dump raw data in and then create views and table extracts for all the commonly used business functions. That’s still overkill though and a simple RDBMS should suffice.

And if you want to raise awareness see this article by StichFix and the HN comments https://news.ycombinator.com/item?id=11312243

7 comments

> Very typical of engineers who just want to pad their resume at the expense of unsuspecting business people.

Or they were given the same PR crap you always get from sales people that they’re just days away from tripling the number of clients and by next year they should be 10-20x the number, so they went ahead and “built it right” so they wouldn’t run into the inevitable scaling issues they were supposedly assured to hit in short order?

A simple architecture should be able to carry this to 10x and even to 100x if you really want to push it.
And I’m not really saying otherwise, though I would somewhat disagree. I’m just saying that they weren’t necessarily (or even likely) thieving contractors who were just looking out for themselves. They built a respectable, usable, system.

Honestly the contractors I see in IT are usually the far opposite end: it works well enough that they’re happy and pay my bill and by the time it doesn’t work anymore I’ll be off to another gig, so who cares?

In many cases, the cause of the problem may not be contractors.

There are lots of clients that clearly set their expectations for contractors who they see as expensive necessary evil: they want you to deliver fast and now, they do not want to hear that bubble that it will take longer to deliver a robust system.

In this case no one technically competent was here to manage them. There were no expectations.
Spot on!!

Everyone has cargo-culted distributed file databases, and they’re good in specific use cases — if you have a large volume of data with a very high number of writes. Hardware and RDBMS performance have improved over the years to the point where if you’re not Google (or certain scientific applications), you probably don’t need much more than postgres. It’s completely within the bounds of feasibility of modern systems to store a 100gb database and its indexes entirely in memory. The only reason you need to scale beyond a single server in most business contexts is when you’re topping out IOPS.

If you just have a lot of data and are doing mostly reads, an RDBMS will almost always be faster for that reason. It’s also FAR easier / faster to write complex queries for an RDBMS.

Oh, and even Google has gone back to a more relational design with Spanner again.
That misses the point, doesn't it. The point isn't "maybe you don't really need nosql/non-relational", it's "maybe you don't need an expensive managed storage solution built for massive scale."

Spanner was indeed built for massive scale, which is reflected in the price.

Hmm, you are probably right.

Spanner does have somewhat less scale than their NoSQL offerings; and even Google says internally to go for the somewhat less scale-y spanner than them. (Because it's easier to react to needs for scale laten than it is to live without transactions and relation querying.)

Yeah... you can go to any DBA/database developer and say "I have a 100GB dataset that might grow to 1TB within 10 years" and they will just pick the RDBMS they are familiar with and you are 90% of the way there.

I work on an ELT process for something that's doing that about now on SQL Server, and not much query tuning is needed tbqh.

I really disagree this is over engineered. This sounds like the problem is under-engineering. You suggest setting up proper infrastrucutre, rather than what they have now which sounds like a shared drive and various different processes written in whatever the person knew to make something quickly.

It's currently one step up from people running notebooks locally and having no shared space for the data.

The term over-engineered has been sufficient diluted to just mean "poorly constructed" at this point.

An my opinion, if your solution is currently not working well, then it can not be over-engineered. Over engineering leads to good solutions that are too expensive, not bad solutions.

I'm interested to hear what other views on what over engineering is. At the very least to get some form of emumerarion.

Well, when several people are working on the same project they "share" the transformed data by connecting to the same EC2 instance. The way data is transformed is via 4 scripts, 2 notebooks and a bunch of manual operations, so no one really wants to touch that. I spent my 4th day working with a contractor to write a Makefile that reproduces all the steps for ONE project.

I talk about adding infrastructure in my original post, but I'm very well aware that my time is currently better spent consolidating the existing as much as I can so the clients can get correct results faster.

As a disclaimer, I work on the BigQuery team, but I wanted to point out that there is now support for transferring data from S3 to BigQuery: https://cloud.google.com/bigquery/docs/s3-transfer-intro
I did use BigQuery in the startup I was working for before, and it worked wonders for our 12Tb of data. I think it would be a bit overkill in our situation---even though not having to manage a DB is great.
That’s the beauty of BQ - it scales well, but it works just fine in smaller use cases. It doesn’t get simpler than SQL.

Another item to consider is that BQ now has ML (simpler) models built in, further reducing the complexity of your pipeline: https://cloud.google.com/bigquery/docs/bigqueryml-intro

If you are not on GCP, then I’d consider AWS Athena for querying the parquet files, but you still have to structure these efficiently beforehand.

I will consider that. How about Redshift?
We had Redshift for our 23TB+ dataset and it worked great. The downside is it can get pricy, so do a cost analysis before you commit. Also know that views in redshift are not materialized so it’s more efficient to create physical tables of the views - which then adds maintenance overhead. The last thing I’ll add is that you’ll need to experiment with compression settings for your data. For us, a combination of ZSTD and bytedict was all we needed
One thing I don't understand regarding resume padding like this (which I do think totally happens) is how do you justify it when someone asks questions about whether it was necessary? It could be very subtle too if they know their stuff and want to see if you know it.

It seems like this would come back to bite in any decent interview.

> how do you justify it when someone asks questions about whether it was necessary?

I know of a local company whose data solution consists of dumping into Segment > S3 files > Pentaho (IIRC) > RedShift, and then using two different BI solutions, depending on the analyst. It needs two full-time data engineers just to keep it alive.

Now the funny part: a dump of their production database is less than 2GB and that isn't going to change any time soon: they don't make that much data to begin with, and their business model doesn't scale.

The argument used for building this new infrastructure is that users used to query directly into the production database and that would allegedly slow down their web app. So they decided they should take an "industry standard" path of handling data. C-levels were too afraid to "just use SQL" and instead asked "what is Amazon doing?".

It is an absolute mess and costed three months of the engineering team just to set up the application to generate the right events, but at least business people has access to data without having to stop an engineer in the hallway.

I don't think this will ever come back to bite anyone in an interview because the fact the dataset has less than 2GB will never come up: interviewers charitably assume that it wasn't overkill or that the person isn't padding the resume.

I frankly believe that a lot of places are like that. We criticize web developers all the time for over-engineering simple apps, but everyone is doing the same in other areas, we just can't see it like we do with web apps.

I think it is worth noting that although some will over-engineer to pad their resume, there are other valid reasons why this may have happened.

It is entirely possible that the folks hired to do the job were better specialized at creating large scale solutions. Client/supplier may have assumed that as a big corp, this segment would scale quickly and a smaller solution would have to be re-engineered at a higher cost later on.

Unfortunately there is insufficient information from stakeholders to make a clear argument.