| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by exhaze 415 days ago

Food for thought, a snippet from a highly specialized project I created two months ago:

https://gist.github.com/eugene-yaroslavtsev/c9ce9ba66a7141c5...

I spent several hours searching online for existing solutions - couldn't find anything (even when exploring the idea of stitching together multiple different tools, each in a different programming language).

This took me ~3-4 hours end-to-end. I haven't seen any other OSS code that is able to handle converting unstructured JSON into normalized, structured JSON with a schema, while also using a statistical sampling sliding window method for handling for all these:

- speculative SIMD prediction of end of current JSON entry - distinguishing whether two "similar" looking objects represent the same model or not - normalizing entities based on how often they're referenced - ~5-6 GB/s throughput on a Macbook M4 Max 24GB - arbitrary horizontal scaling (though shared entity/normalization resource contention may eventually become an issue)

I didn't write this code. I didn't even come up with all of these ideas in this implementation. I initially just thought "2NF"/"BNF" probably good, right? Not for multi-TB files.

This was spec'd out by chatting with Sonnet for ~1.5 hours. It was the one that suggested statistical normalization. It suggested using several approaches for determining whether two objects are the same schema (that + normalization were where most of the complexity decided to live).

I did this all on my phone. With my voice.

I hope more folks realize this is possible. I strongly encourage you and others reconsider this assumption!

1 comments

victorNicollet 415 days ago

The snippet you shared is consistent with the kind of output I have also been seeing out of LLMs: it looks correct overall, but contains mistakes and code quality problems, both of which would need human intervention to fix.

For example, why is the root object's entityType being passed to the recursive mergeEntities call, instead of extracting the field type from the propSchema?

Several uses of `as` (as well as repeated `result[key] === null`) tests could be eliminated by assigning `result[key]` to a named variable.

Yes, it's amazing that LLMs have reached the level where they can produce almost-correct, almost-clean code. The question remains of whether making it correct and clean takes longer than writing it by hand.

link