| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by u678u 1989 days ago
	Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.

7 comments

musingsole 1989 days ago

Most orgs working with small data I've seen will just fall back to the tutorial version of some big data tools (often times just eating the unused infrastructure cost for something like Hadoop when they're generating biweekly reports). Most project managers have a dream of their project scaling up and want to be prepared should a dream become a reality. And if you "under-engineer" (by which I mean specifically engineer for the problem the company is facing), you'll get called out by every armchair developer for not going with the "obvious, best solution."

I'm not bitter; you're bitter. /s

link

u678u 1989 days ago

Yeah I'm bitter. I want to bring structure, tools and discipline to small scale data gathering, but the big tools are just too time-consuming to get up and keep running with just a few hours.

link

kthejoker2 1988 days ago

Depending on what your sources and sinks are:

* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL

* Trifacta's Wrangler has a free version with limits

* Talend's Open Studio is free, a little clunky but works fine

* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)

* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well

* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers

The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion

Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.

link

hermitcrab 1989 days ago

Take a look at our https://www.easydatatransform.com tool. It is a drag and drop data munging tool for datasets up to a few million rows. It runs locally on Windows or Mac. You should be able to install it and start transforming your data within a few minutes. It doesn't have a built in scheduler (yet), but you can run it from the command line.

Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.

link

u678u 1989 days ago

Thanks there are enough workstation tools, but I want an automated tool that runs on a server.

link

kfk 1989 days ago

I have been working with small data few years. We built an internal library to move data in/out of systems and the schedule this into jobs. We mostly leverage S3 and Spectrum. The major complexities we found were in scheduling, proxies and fetching raw data from legacy applications.

link

imcoconut 1989 days ago

No reason you cant still use airflow for the orchestration/depenendecy management/scheduling. And for the processing and storage - Pandas and sqlite.

Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)

link

throwaway7281 1988 days ago

Take look at luigi, which is a lightweight task orchestrator with minimalistic dependencies.

[1] https://github.com/spotify/luigi

link

ABeeSea 1989 days ago

In AWS, lambda’s and step functions.

https://aws.amazon.com/step-functions/

link

u678u 1989 days ago

Thanks, presumably this is similar to AWS Airflow too.

link