|
|
|
|
|
by amelius
4084 days ago
|
|
Computation intensive tasks often take large amounts of data as input. And sharing data with a worker always has to be done by serializing this data (in a message). So for large inputs, this approach doesn't work (the main thread would block the cpu while serializing the messages). But my biggest problem with workers is that they don't have an event-loop, so I can't share asynchronous code between the main thread and the workers. |
|
The way it works is you open a download stream from S3, pipe it into a Node.js transform stream, and then pipe that stream into an upload stream that uploads the data back to S3 using the multipart upload API.
The Node.js design is very much like using Unix pipes. You can pipe a huge multi TB file through grep without blocking anything. The data just streams from disk into the grep process, grep filters it down to things that match, and then streams the results onto the screen.
Computation on huge streams in Node.js works the same way. Your event loop remains unblocked even when operating on a stream TB's in size because you are only ever touching a portion of the dataset at a time. Additionally if you do it properly your overall memory usage remains low as you are exporting the data back out of the machine as fast as it comes in. I've used this technique to process streaming data many GB in size while keeping the node process under 200 MB of memory used from the system perspective.
Recommended reading: https://nodejs.org/api/stream.html
Here is an example of an upload stream that I created for the use case of processing a large multi TB data set and piping the result up to Amazon S3: https://www.npmjs.com/package/s3-upload-stream