| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by polotics 1360 days ago
	Word! Storing JSON is so often the most direct and explicit way of accruing technical debt: "We don't really know what structure the data we'll get should have, just specify that it's going to be JSON"...

3 comments

gtowey 1360 days ago

I like to say that when you try to make a "schemaless" database, you've just made 1000 different schemas instead.

link

Gh0stRAT 1360 days ago

Yeah, "Schemaless" is a total misnomer. You either have "schema-on-write" or "schema-on-read".

link

layer8 1360 days ago

Schemaless means there’s no assurance that the stored data matches any consistent schema. You may try to apply a schema on read, but you don’t know if the data being read will match it.

link

TickleSteve 1360 days ago

"schema in code" covers all bases.

link

stingraycharles 1360 days ago

But if you’re not storing data as JSON, can you really say you’re agile? /s

link

weego 1360 days ago

Look, we'll just get it in this way for now, once it's live we'll have all the time we need to change the schema in the background

link

stingraycharles 1360 days ago

We don’t have a use case yet, but let’s just collect all the data and figure out what to do with it later!

It’s funny how these cliches repeat everywhere in the industry, and it’s almost impossible for people to figure this out beforehand. It seems like everyone needs to deal with data lakes (at scale) at least once in their life before they truly appreciate the costs of the flexibility they offer.

link

beckingz 1360 days ago

The Data Exhaust approach is simultaneously bad and justifiable. You should measure what matters and think about what you want to measure and why before collecting data. On the other hand, collecting data in case what you want to measure changes later is a usually lowish cost way of maybe having the right data in advance later.

link

stingraycharles 1360 days ago

Oh I agree, that's why I was careful to put "at scale" in there -- these types of approaches are typically good when you're still trying to understand your problem domain, and have not yet hit production scale.

But I've met many a customer that's spending 7-figures on a yearly basis on data that they have yet to extract value from. The rationale is typically "we don't know yet what parameters are important to the model we come up with later", but even then, you could do better than store everything in plaintext JSON on S3.

link

kevindong 1360 days ago

You can't realistically expect every log format to get a custom schema declared for it prior to deployment.

link

xwolfi 1360 days ago

If you never intend to monitor them systematically, absolutely!

If you're a bit serious you can at least impose date, time to the millisecond, pointer to the source of the log line, level, and a message. Let s be crazy and even say the message could have a structure too, but I can feel the weight of effort on your shoulders and say you ve already saved yourself the embarassement a colleague of mine faced when he realized he couldnt give me millisecond timestamp, rendering a latency calculation in the past impossible.

link

kevindong 1360 days ago

Sorry if I was ambiguous before. When I said "log format", I was referring to the message part of the log line. Standardized timestamp, line in the source code that emitted the log line, and level are the bare minimum for all logging.

Keeping the message part of the log line's format in sync with some external store is deviously difficult particularly when the interesting parts of the log are the dynamic portions that can take on multiple shapes.

link