| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by leptons 86 days ago

It can be, but $500k/year is absurd. It's like they went from the most inefficient system possible to create, to a regular normal system that an average programmer could manage.

I have no idea if they are doing orders of magnitude more processing, but I crunch through 60GB of JSON data in about 3000 files regularly on my local 20-thread machine using nodejs workers to do deep and sometimes complicated queries and data manipulation. It's not exactly lightning fast, but it's free and it crunches through any task in about 3 or 4 minutes or less.

The main cost is downloading the compressed files from S3, but if I really wanted to I could process it all in AWS. It also could go much faster on better hardware. If I have a really big task I want done quickly, I can start up dozens or hundreds of EC2 instances to run the task, and it would take practically no time at all... seconds. Still has to be cheaper than what they were doing.

1 comments

makapuf 84 days ago

Curious about the workload, but as Im trying to make a tool about json, what are those files compressed with? What is the size of the average file ? What is their structure (ndjson ? Dict with some huge data structure a few level deep?)

link

leptons 84 days ago

In S3 the JSON is stored in plain-old .zip files. While downloading to local the files are unzipped to plain old JSON. It's basically an object containing tons of data about each website I manage including all fragments of HTML and metadata used on the sites. It can get quite large, some sites have thousands of pages. We often need to find things stored many levels deep in the JSON that may be tricky to find, it isn't usually a specific path, and lots of iterable arrays and objects are involved. The files range from ~20MB to ~400MB, depending on how much content each site has. And we have ~9000 total sites.

link