Hacker News new | ask | show | jobs
by frou_dh 2017 days ago
I remember seeing this used inside config files for the Caddy webserver.

Google seem to like these executable config languages because they've got another open source one ("Starlark") a few notches up in expressivity.

5 comments

I'm an ardent supporter of executable config languages, especially for the infrastructure-as-code space (the only thing special about this space is that configs tend to be very large, so you're more likely to run into reuse issues), which markets itself as "it's just YAML!" but inevitably all of that copy/pasted YAML becomes unwieldy and you want reusability. At that point, you have a few distinct options:

1. Build an AST on top of YAML a la CloudFormation. Now you're programming in YAML, hurray!

2. Extend your static language with executable features a la Terraform/HCL, basically reinventing (and very badly, at that) more traditional language features

3. Use text templates, a la Helm--now you can generate syntactically invalid configuration! (and absolutely trivially, at that)

4. Use an expression language (familiar, ergonomics) a la Pulumi, Starlark, Nix, Nickel, Dhall, etc

Note that in these conversations, someone inevitably shouts "use the simplest tool for the job!" ignoring that static configuration languages (and options 1-3 above) are strictly more complex for the reusability use cases outlined above.

EDIT: Pulumi isn't an expression language; rather, it lets you use real languages to generate configuration, and these languages often include powerful expression features. AWS's CDK is also in this category.

> someone inevitably shouts "use the simplest tool for the job!"

I think one problem is that configuration needs start out simple and evolve to complex - as opposed to being obvious from the start you'll need dozens of discrete components to deploy. At that early stage, anything beyond a few lines of YAML seems like definite overkill.

Eventually, it becomes clear that it's very hard to maintain, but by then there's thousands of lines of hard-knocks battle-tested, working production config to try and basically recreate from scratch, without breaking anything.

Until you've gone through this process once (or maybe a couple times..) it's hard to see why you should make that initial leap to a much more complex, tools-required workflow. "But eventually, we will probably need..." is a tough sell against someone arguing to do the simplest thing.

That’s where experience comes in. We should know that certain domains (e.g., Kubernetes configs) are going to get unruly fast and we oughtn’t waste time with YAML. I don’t think this will be a controversial opinion in a few years time.
Also; more generally: it makes sense to KISS, and the key risk here isn't somebody using yaml or json or whatever initially - even where experience shows that's insufficient, it's just not that costly either. The question is what to do when that becomes unwieldy. And I think it's pretty clear that kinda-sorta-programming that tries to incrementally extend stuff like static config languages - but only slightly - doesn't work well and is a bad idea. It's inconvenient; it results in many of the same issues as a full programming language, and it's often really inconsistent in its expressiveness - as in, for any given application thereof you're likely to run into limitations.

I think it's wise to try and skip as many of those intermediate stages as possible. Of course; that's not a clear-cut solution strategy either; because what's "as possible"? Exactly how high up the language chain do you need to go; conversely which language (and environment) features are too powerful, rendering the language difficult to contain?

> I'm an ardent supporter of executable config languages

Me too. I feel like there should be some eponymous law about this. Every declarative language that starts out trumpeting "simplicity and not being Turing complete is a feature!" ultimately grows features until it is an imperative language, or gets replaced by one that is.

If you're gonna get there anyway, you may as well design for that instead of bolting on features poorly after the fact.

What about the option of just writing a regular, one-off program in a regular programming language, the output of which is your baked YAML config; and then having a pipeline that involves running that config-generator program, piping its output to your orchestrator of choice?

Nearly every programming language has a YAML serialization library†. And before that serialization happens, your config can be expressed using the regular-ass coding features of your program, however you like. (For optimal clarity, I personally would suggest creating a builder DSL and using it.)

† Technically a language doesn't even need a YAML serialization library to emit valid YAML; because valid JSON is also valid YAML. You can just serialize to JSON on your end, and feed the result into anything that's expecting YAML.

At that point why use YAML at all? If it's generated by a program and fed to a program, you're better off using protobuf or something like that. In fact, since you're probably using the same language on both ends, why not just write a regular value in your language?

This probably sounds like a strawman, but it's not. It's how a lot of e.g. Python projects are configured - the "config" file is just a normal bit of code that gets run to produce a value. Unless you're using a programming language that absolutely sucks at expressing plain values (e.g. C or Java), it's much better than separate config files, IMO.

> At that point why use YAML at all?

Ideological answer: For the same reason HTTP/2.0’s binary protocol didn’t instantly obviate/deprecate HTTP/1.0’s text protocol. Text has advantages: text is debuggable, and prototypable. If the interface between two programs is a text based declarative language, you can audit that text, diff that text, edit that text to see how changes affect the result, mock one side or the other by producing or consuming that text, etc. “GitOps” style config management would never work if config was all opaque binary blobs. These are all reasons that major software projects standardize on YAML or other widely-supported textual data serialization formats for their config.

Pragmatic answer: because we’re talking about production configuration management, here, which is, 99% of the time, about writing configuring and managing the third-party black-box components in your stack, not your own components. Your own business layer usually can be configured conventionally, with minimal explicit config, for your use case, since you built it to work idiomatically for that use-case. It’s all the third-party stuff that has an impedance mismatch to your use-case assumptions, translating to needing tons of config to do what you need.

And, obviously, if you don’t control the other end, you don’t decide how the other end does its config. Usually, these days, it’s YAML (or TOML) — for the ideological reasons mentioned above.

Example: Kubernetes. Big consumer of complex YAML. Many people try to template that YAML. Much simpler and less error-prone to just write a program to generate said YAML. No reason to assume you’re writing in whatever language the k8s orchestrator is written in. (In fact, there are multiple orchestrators, written in different languages, and the shared YAML resource spec is the only formal interface they share.)

> Ideological answer: For the same reason HTTP/2.0’s binary protocol didn’t instantly obviate/deprecate HTTP/1.0’s text protocol. Text has advantages: text is debuggable, and prototypable. If the interface between two programs is a text based declarative language, you can audit that text, diff that text, edit that text to see how changes affect the result, mock one side or the other by producing or consuming that text, etc.

I can see the argument for using a textual format (although I think it's weaker than you say; if we're generating this config with code then we don't want to diff or edit the generated config), but YAML seems like a singularly poor choice if you want reliable diffs and editing; it's like picking tag-soup HTML. Straight JSON (ideally with a schema), TOML or even XML seems like a better bet if you're generating it programmatically.

> And, obviously, if you don’t control the other end, you don’t decide how the other end does its config.

Right, in that case it's all moot. I took GP to be talking about what formats these tools should use. IMO if the tool is intended to consume a machine-generated config then it would be better to use a machine-oriented config format. I think the option of something like protobuf (which is language-independent) is underappreciated, but even restricting ourselves to textual options, something stricter than YAML seems like a better bet.

But the third-party tool frequently isn’t intended to (only) consume machine-generated config. It’s usually built to consume a format that could equally be machine-generated or hand-authored. Usually with an emphasis on hand-authoring, where machine-generation is an automation over hand-authoring that will only need to happen as one scales; and so high-complexity machine-generation will only be relevant to the most enterprise-y of integrators.

Other examples of formats like this, that are hand-authored in the small but generated in the large: RSS, SQL, CSV.

Again, Kubernetes is a prime example of this. K8s config YAML is designed with the intention of being hand-authored and hand-edited. It’s only when devs or their tools need to auto-generate entire k8s cluster definitions, that you begin needing to machine-generate this YAML. This generated YAML is expected to still be audited by eye and patched by hand after insertion, though, so it still needs to be in a format amenable to those cases, rather than in a format optimal for machine consumption.

> if we're generating this config with code then we don't want to diff or edit the generated config

Look more into GitOps. The idea behind it is that whatever tooling you’re using to generate config is run and the resulting config is committed to a “deployment” repo as a PR; ops staff (who don’t necessarily trust the tooling that generated the config) can then audit the PR, and the low-level changes it describes, before accepting it as the new converged system state. It puts a human veto in the pipeline between machine-generated config and continuous deployment; and allows for debugging when upstream tweaks aren’t having the low-level side-effects on system state one would expect.

Things like AWS Cloudformation require YAML input, so there's no real choice on what you emit.

But writing the YAML is fiddly and annoying, so that's a good example of something where it is better to generate it via troposphere (a python module) or some similar system.

To be less specific I guess the answer is that sometimes you don't control both ends - the part that emits and the part that consumes, and having faught ansible, and similar tools, if I can avoid it I'd never want to write YAML by hand for non-trivial purposes if I could script it instead.

Just write JSON and pretend it's YAML. YAML is a superset of JSON so there's no need to generate "nice" YAML if there isn't a human reading or writing it.
It’s still good for humans to be able to debug it, and there’s no downside to generating YAML over JSON (I say this as someone who typically prefers JSON).
This is even more true for Ruby. The language is famous for the ease of creating DSLs because of block passing and optional parentheses. Examples: puppet, chef, vagrant, Rails' configuration files. I still remember the joy of not configuring a project with XML coming from Java Structs in 2006.
> It's how a lot of e.g. Python projects are configured - the "config" file is just a normal bit of code that gets run to produce a value.

Which is the root of a ton of different problems and issues and generally regarded as a bad idea. See pep518 and pyproject.toml vs setup.py

After trying to write complex loop statements and conditionals in yaml for ansible I had this thought as well. It's nice when declarative configurations work, but once they don't and you have to try to write a real program in yaml you'll want to pull your hair out.
This was my exact experience as well. I was an early user if sensible and loved that it was yaml. Then had to deal with jinja inside yaml. And finally a syntax for loops appeared!

I have run into the same type of issues with salt.

Yeah, I do this for quite a few things. It works pretty well. In my mind that was implicit in the fourth point, but I didn’t spell it out.
> I'm an ardent supporter of executable config languages

I agree, but it becomes very important to limit scope. For example, Azure templates allow looping and conditionals and all sorts of fancy stuff. Approaching a typical library of ARM templates is a massive undertaking: reading a JSON (or YAML) if statement is not ergonomic in any way - causal relationships can be multiple screens (or files) apart because of the sheer amount of JSON required to represent executable code.

It should be kept relatively lightweight, with stuff like CloudFormation GetAtt to glue deployed things together. Anything more complex should be solved with tooling designed for computation, i.e. programming languages (that emit config, e.g. Pulumi).

I am with you, I wish the world would just adopt Lua as their config files.
I haven't used Lua, but I've used Starlark extensively and I will say that static typing is a boon, especially in the infra-as-code space where the feedback loop can be very long.
Good news. Starlark-go finally supports protocol buffers, so at least the output of your script gets some type checking.
Then you'll need config files for your config files, and we're back to square one.

The scope and capabilities of the config language needs to be limited, otherwise we lose the ergonomic benefits of configuration in the first place.

I sort of agree with this, except that I don't think it's square one. The ability to change config without rebuilding your artifact is one of the advantages of separate config files, and using an embedded language like lua wouldn't remove this advantage.
Completely agree. Executable configuration in a language with strong declarative programming support is superb. I've had great success embedding Lua into a C++ application for exactly this purpose.

The moment you encode alternation in the configuration file, its time to think about biting the whole Turing complete bullet.

I would love to know your thoughts about https://github.com/pragmalang/pragma
From point 1, this is my favorite wat example in ansible...

  - file:
      path: /etc/foo.conf
      owner: foo
      group: foo
      mode: 0644
Starlark is Python (thanks Guido!), while CEL is designed specifically to not be Turing complete or have constructs like loops, etc.

"CEL evaluates in linear time, is mutation free, and not Turing-complete. This limitation is a feature of the language design, which allows the implementation to evaluate orders of magnitude faster than equivalently sandboxed JavaScript."

As mentioned, the goals are security policies (it was first used internally as the Security Rules for Cloud Storage for Firebase and the Cloud Firestore) and proto contracts (e.g. you could define addons to your proto to specify the data matched certain behavior):

I forget the exact syntax for the contract, but it looked something like this...

``` message person { @contract(matches(/* RE2 phone number regex */)) string phone_number = 1; ... } ```

That data could enforce client side checks as well as be used server side (in different implementation languages).

I always wanted to see it combined with the proto to Firebase Security Rules generator (https://firebaseopensource.com/projects/firebaseextended/pro...) to do client and server validation.

> Starlark is Python (thanks Guido!), while CEL is designed specifically to not be Turing complete or have constructs like loops, etc.

Sort of. Starlark doesn't (or at least didn't originally) support recursion or while loops or a number of other structures. There's also a few other differences that make starlark "better" for configs (some immutability is different, there's no such thing as a `class`, etc.)

I still support loops in a configuration language

    for x in sequence:
      generate_complex_thing(x)
or

    [generate_complex_thing(x) for x in seq]
are better than a lot of the more declarative approaches (such as the various contextual approaches of a number of alternative langs) which get hard to reason about because they represent implicit global state.
Starlark is also designed specifically to not be Turing complete (it is only a subset of Python).
It's not really a config language. It's directed at fast execution of expressions and being able to provide some measure of type safety.
"Executable" configuration languages (most "non-executable" configuration language parsers are push down automatons that execute the configuration) are handy, but without strong coding standards and good discipline, the line between business logic and configuration tends to blur over time.

Cartesian product, map, and reduce operations over finite sets and lists are really handy in configuration. ("For each server in SetA look at each path in SetB and ...") But, if you find yourself starting to write general loops (as opposed to loops implementing map, reduce, and Cartesian product in languages that don't have them built-in), it's a sign you're starting to blur the line between configuration and business logic.

Unit-testing configurations is difficult, especially if they can be non-deterministic (depend on data/time/random()) and aren't modular.

In some sense, all programs with configuration files are really interpreters for the language of their configuration files. (As mentioned before, many of these abstract machines are just push down automata.) Taken too far, the configuration becomes the real program.

I've seen a (now retired) automated trading system with a powerful XML-based configuration language where a few times people got themselves into trouble (and caused trading losses) when their complex tower of configuration fell over. Part of the problem was there existed a few people who weren't trusted to write application logic, but who were trusted to "just update configurations". When the only tool some of your people are allowed to use is a hammer, hammer marks start mysteriously showing up everywhere. Additionally, this was over 10 years ago, and prior to these trading losses, configuration underwent less stringent review. I don't think my experience was atypical.

I've also seen configuration loading get stuck because someone added some code to the config to hit a REST endpoint in the middle of the configuration file. Ideally, you'd leave any I/O to the main program logic, where it's easier to perform the I/O asynchronously, or otherwise non-blocking.

Deterministic non-Turing-complete immutable "executable" configuration languages (or at least ones where it's difficult to get unbounded recursion) tend to be a happy medium. Also, declarative rather than imperative configuration languages tend to be easier to read.

Back when I was a developer in web search infra at Google, I vaguely remember once or twice using a language (maybe Borg's config language, borgconfig) that completely lacked mutability and essentially used object prototyping (A is created as a copy of B, with differences specified at object creation time.)

> When the only tool some of your people are allowed to use is a hammer, hammer marks start mysteriously showing up everywhere.

This ... yes. Being the ops guy backing up second line support at an ISP for a while brought me many examples of this and inspired many in-house tools for them to use instead.