Hacker News new | ask | show | jobs
by lolxmlhateonhn 1004 days ago
That’s called “schema.” It’s a pretty out there concept, I know, the idea that you should be burdened to document the structure and intent of your data for both human and mechanical consumption. I realize I’m being forceful here, but keep in mind, you are compensated incredibly well. If it takes you another hour to save ten down the road, earn the pay. I don’t understand this aversion to tough stuff - which seems to be pretty popular here - and I’m starting to think I should interview for it a bit harder than I already do.

The problem with this thinking is that you, personally, are then forbidden from arguing for the use of a strictly typed language for development because it’s the opposite position to the one you’re holding here. The exact reasons we use languages like those are the same reasons we should be explicit with our schemas. It’s unfortunate that many people try to argue both sides due to the convenience, as you say, of a single line parse, when years of experience has taught that duck anything is a bug fountain. (Not saying you are arguing both, by the way, it’s just common.)

Try reading back your gripe with the following in mind: do I have a stronger complaint than “it’s difficult” here? I think you’ll find that you don’t convey one effectively.

6 comments

I have seen the same aversion and lack of pragmatism so many times it has started to impact my motivation.

Examples:

1. When taking over a project, developers glanced at the code and decided it would be better to spend 6 months rewriting from scratch. The end result was not more readable than the original solution and introduced a new set of issues.

2. Many put too much emphasis on the worst case scenario and do not consider the average case. I worked a lot with many different XML formats and most of them were OK. Not "fun", but simply OK. I have to admit that I did struggle with some complex files, but there were plenty of times where the XML was simple, readable and easy to work with

3. When comparing programming languages they often focus on a few features and don't think about productivity in general. Languages like Java can actually be very productive, even if your favorite language can reduce null checks.

There's a happy medium to be found here - balancing ease of use, while avoiding the bug fountain. Having said that, IMHO we should err on the side of avoiding the bug fountain. Make it as simple as possible, but not simpler.

Robbing the future with deceptive over-simplicity - by creating a bunch of future difficult debugging scenarios and possibly footguns - is the far worse evil, than missing out on the maximally-convenient onboarding (which can be foolishly optimized for, for the sake of short-term popularity). All such crap-tastic solutions will eventually need to be replaced again anyway, creating an endless, hellish, slow churn.

I like rigid schemas for write and supporting both loose and rigid reads. Generally the friction in a system like that is something like the ad hoc SRE trying to load up a type stack just to interpret a protobuf log. Or Tableau. Stuff like that. That’s where people get annoyed. I think you’re right and there’s a lot of unexplored directions of simplicity.

Computers that understand the shape of your data are very helpful friends when you’re pursuing goals like data locality.

Agreed. This is the same mentality that brought us MongoDB.

"Don't want to have to create a schema? Normalization is confusing? No problem, just chuck JSON in here instead!"

The problem is that XML doesn't even map that well even to objects and classes, even with annotation. And XSD is quite a heavyweight format with terrible UX.

I'm pretty much consistently on the strict/type-safe side of the "should we have a schema" debate, but there are better options out there to maintain a consistent schema for either data interchange.

JSON is simpler to map, faster to parse, simpler, more lightweight, and less dangerous to add to an online app[1]. You can also use a schema like JSON Schema for inter-app compatibility. It has replaced XML as the standard data interchange format for a reason. It's not great for configuration files, and it's definitely being overused nowadays, but it is a solid data interchange format.

Then you've got binary formats like Protocol Buffers which are even more lightweight and faster to parse and (generally) have schemas that map better to typed languages.

I think the OP has it right: XML is very well-suited as a generic document format. I wouldn't compare it to YAML, because in a perfect world they shouldn't compete in the same categories: Nobody should use XML for configuration or YAML for documents. And I also agree that there are better formats than YAML. I like the ease of writing indented multiline strings in YAML, but the fuzzy typing is pretty terrible. At least YAML 1.2 fixed the Norway problem.

[1] https://en.wikipedia.org/wiki/Billion_laughs_attack

> JSON is simpler to map, faster to parse, simpler, more lightweight

JSON is no simpler (nor more complicated) than XML, if you're using a library. It certainly isn't faster - a SAX parser is faster than a JSON DOM parser (and JSON streaming parsers equivalent to SAX is rare).

> It has replaced XML as the standard data interchange format for a reason.

the reason isn't technical. It's competency (or lack thereof). Most interchange formats are for websites in browsers, where JSON performs well, since there's no native way to parse XML in the browser. So that mindshare from the web has leaked out to other arenas.

That's crazy talk. JSON is very simple. The irreducible complexity of JSON is lower than irreducible complexity of XML, and irreducible complexity of XML is lower than YAML.
Json is simple, but the common idiom is to read it al at once into memory and forget the difficult stuff (which encoding is this json in anyway). Then don't do the other difficult stuff (is this a string or a date type)... so it's faster (maybe) to parse but it doesn't scale for large datasets and your application will need additional deserialization logic. A broad sweeping performance statement like this is just spreading fud.
JSON is hardly very simple given that is a subset of JavaScript. There's also problems with different behaviors of JSON parsers: https://portswigger.net/daily-swig/research-how-json-parsers...
Consider that YAML is superset of JSON. Then realize any Yaml parser suffers from JSON faults+ YAML unique faults.
Regarding security: entity expansion bugs have been fixed long ago. In the other hand: people still use eval() on json objects to parse them. So i don't get that. Json schema: which one? Afaik it's not there and there is no single json schema with the tooling depth and breadth of xsd. Protobuf: nice but unreadable for humans. Might as well use corba or asn.1 coding.
- say you have an incoming data document. Say you need to programmatically read it / scan it / extract from it (think: cli and pipes).

You want to access this data for any of dozens of reasons. Graphs, logs, data points, transformations, data feeds, whatever.

I can do that task in json and yaml 10000% faster than with XML. With XML, you may have a schema (hope the document matches the putative schema!). Oh the schema is an http reference? Hope that still exists out there, the internet never breaks links. If you don't, well shit, is this tag beginning a list or a "subdocument"? Am I REALLY using the DOM api to step through nodes and attributes and CDATA? Guess I have to. There goes a day of coding.

Oh, in JSON and YAML, it's ONE LINE OF CODE to get it into something I can easily read, manipulate, analyze?

- say you have an upgrade program. it just needs to read in the old config file, rename some keys, add some new default values, etc. JSON/YAML? I can do that in stupid-simple code. XML? Well, I better hope there exists a library that loads this shit for me in my preferred language, or otherwise lots of fun with DOM. I forget, can I use regex to parse XML? (that is a joke)

- say I want to serialize an object graph pretty quickly for over the wire between languages. Do I want to write a complete XML mapping in two languages, or just do the one-line serialize, one-line deserialize? Yeah.

- say I want my config files to be somewhat extension friendly for plugins / extensions. XML parsing code? Yeah, that will be a ton of custom code. YAML/JSON deserializing to a map/dictionary? Oh, look at that, extension friendly code. Allow them to specify whatever json/yaml struct in their plugin section and pass it to the extension.

This stuff happens with such frequency that I never, ever think "man I wish this was XML".

Do YAML/JSON have some issues? Do I wish XPath and some XML features have json equipvalents? Sure ... very occaisionally. Actually, never.

Where to begin... If the schema is not there you are in the same position as with receiving some json or yaml: a bad one. Is 20230806 and integer or someones idea of sending a day? Parsing the data into a memory structure is a one liner in any language. Assigning meaning however.. Object graph between languages as you subscribe works only in a very small number of cases. Sent your data to a mainframe and see it disappear faster than you can sent it. Extensions is where xml shines: use an extension namespace. Hell use a namespace per plugin. Unknown namespaces are normally ignored during deserialization if you use a schema, so no line of code needed at al. All the issues you describe boil doen in my book as: i don't know how to do this properly with xml so xml is bad.
I think there's a happy middle ground that I'm shocked doesn't exist or is obscure as hell because I've never found it in the wild is a "default" schema that is compatible with JSON types and can be serialized the same way JSON does. Because whenever it comes up on the on-disk format seems pretty irrelevant people just want to have an xml.load to map to their language primitives in a sane way.

There's this kind of stuff but it's niche and more convention than actual schema.

https://www.xml.com/pub/a/2006/05/31/converting-between-xml-...

https://untangle.readthedocs.io/en/latest/

https://github.com/martinblech/xmltodict

I think there could be, but it's ugly. The XStream library wasn't horrid, but it basically provides a way to serialize/deserialize from the java object defs, which is still a schema of sorts.

You know <key name="blah">key value</key>. It just highlights the extra verbiage, and pushes people towards "just use json/yaml".