Hacker News new | ask | show | jobs
by ellimilial 1861 days ago
Very interesting how Github comes with more and more interesting 'actions' to turn repos into 'platforms' and moves us closer to serverless future.

@idan how does it scale with the size (including storage)? Is 'a billion rows' a goal or an actual tested use case?

3 comments

Hi! Jason, CTO @ GitHub, here

You’re getting at the heart of Actions. Actions was never intended to be “CI” or any such vertical capability. It has always intended to be a platform that exposes capabilities like CI or packages etc out to the world, but the underlying serverless very flexible workflow platform is the bedrock upon which we want to build the future

My long held view that the only real ‘competitor’ to what I want github to be was AWS/major cloud infra companies and if you believe in that view along with me, you likely see what the why the past four years of github and the next few years of github make a lot of sense

And it even makes more sense when you squint just a bit and realize what codespaces + repos + actions (CI/security/packages + other things) + automated workflows would eventually do. Now imagine a bit further out into the future and what it would mean if we understood your production workloads a bit more

Hi Jason, thank you very much for the background and the explanation. It is fascinating to see the progress in this direction.

I started raising my eyebrow (in the best possible sense) upon seeing parts of tooling very similar to ours but simpler and more importantly - without moving parts. We operate in biomedical data space and deal with flat/static data a lot, for example we power https://biokeanos.com with data-in-repo, so Flat Data was immediately interesting.

It is really inspiring to see GitHub actions to having a foray in this direction, definitely something to keep an eye on.

If this is the vision, please let us write actions directly with typescript or some legitimate programming language (not YAML). It is currently impossible to debug and reuse action code.

I am working on an entire company migration off GitHub actions because it cannot scale. Full programmatic control and local debugging that allows me to reuse and test code in a single repository would have justified staying with GH.

You can already do that in a few ways like with docker or the run command. It’s been there from the beginning. YAML is just the config.

https://docs.github.com/en/actions/reference/workflow-syntax...

Those only work at the actions level. I need them at the workflow level. The biggest issue here is the differentiation between actions and workflows, I need my workflows to be treated as actions and reuse entire segments of them. This isn't possible without copy/pasting code.

I also need arbitrary logic to configure and run my workflows (like an else branch... that would be nice).

This is in large part why Team City and Jenkins beat out Actions when we reevaluated.

The YAML file is not a config file in the workflows I have written, it is the top level program calling many other programs. The syntax limitations (and unsafety of its interpreter) make that unwieldy. But it's not possible to workaround without never using any other action in the marketplace, which kind of defeats the purpose of using actions at all.

It doesn't scale! This isn't a replacement for databases.

Our take on this is about "working sets" of data — if you have billions of rows, that's a lot bigger than a working set! At some point, you have to query, filter, and aggregate to get your data down to a chewable size for work.

You can do that in your code too, and sometimes that's absolutely the right approach! But often it's easier to push that work to "outside your code," and that is what Flat is great for.

Thank you for the response and clearing up the 'billion rows' / surly bonds confusion I had from reading project's Why Flat Data? section. I think I understand the target use case slightly better now.

One of the strong arguments for object-like storage (S3 etc) in the context of plain / flat data is scalability and availability for large scale processing frameworks. Databases are only occasionally relevant.

It's storing the files in the repository which has a file size limit of 100MB. I think the repositories themselves have a soft limit of 5GB and a hard limit of 100GB.