Hacker News new | ask | show | jobs
by u678u 1989 days ago
Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.
7 comments

Most orgs working with small data I've seen will just fall back to the tutorial version of some big data tools (often times just eating the unused infrastructure cost for something like Hadoop when they're generating biweekly reports). Most project managers have a dream of their project scaling up and want to be prepared should a dream become a reality. And if you "under-engineer" (by which I mean specifically engineer for the problem the company is facing), you'll get called out by every armchair developer for not going with the "obvious, best solution."

I'm not bitter; you're bitter. /s

Yeah I'm bitter. I want to bring structure, tools and discipline to small scale data gathering, but the big tools are just too time-consuming to get up and keep running with just a few hours.
Depending on what your sources and sinks are:

* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL

* Trifacta's Wrangler has a free version with limits

* Talend's Open Studio is free, a little clunky but works fine

* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)

* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well

* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers

The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion

Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.

Take a look at our https://www.easydatatransform.com tool. It is a drag and drop data munging tool for datasets up to a few million rows. It runs locally on Windows or Mac. You should be able to install it and start transforming your data within a few minutes. It doesn't have a built in scheduler (yet), but you can run it from the command line.

Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.

Thanks there are enough workstation tools, but I want an automated tool that runs on a server.
I have been working with small data few years. We built an internal library to move data in/out of systems and the schedule this into jobs. We mostly leverage S3 and Spectrum. The major complexities we found were in scheduling, proxies and fetching raw data from legacy applications.
No reason you cant still use airflow for the orchestration/depenendecy management/scheduling. And for the processing and storage - Pandas and sqlite.

Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)

Take look at luigi, which is a lightweight task orchestrator with minimalistic dependencies.

[1] https://github.com/spotify/luigi

In AWS, lambda’s and step functions.

https://aws.amazon.com/step-functions/

Thanks, presumably this is similar to AWS Airflow too.