| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by amelius 4084 days ago
	Computation intensive tasks often take large amounts of data as input. And sharing data with a worker always has to be done by serializing this data (in a message). So for large inputs, this approach doesn't work (the main thread would block the cpu while serializing the messages). But my biggest problem with workers is that they don't have an event-loop, so I can't share asynchronous code between the main thread and the workers.

2 comments

NathanKP 4083 days ago

There is no need to serialize large amounts of data. The way it is designed to work in Node is you use a stream. So for example lets say you have a multi TB data dump in Amazon S3, you want to process it, and then upload a transformed multi TB result set back to Amazon S3. (This is something I've worked on before).

The way it works is you open a download stream from S3, pipe it into a Node.js transform stream, and then pipe that stream into an upload stream that uploads the data back to S3 using the multipart upload API.

The Node.js design is very much like using Unix pipes. You can pipe a huge multi TB file through grep without blocking anything. The data just streams from disk into the grep process, grep filters it down to things that match, and then streams the results onto the screen.

Computation on huge streams in Node.js works the same way. Your event loop remains unblocked even when operating on a stream TB's in size because you are only ever touching a portion of the dataset at a time. Additionally if you do it properly your overall memory usage remains low as you are exporting the data back out of the machine as fast as it comes in. I've used this technique to process streaming data many GB in size while keeping the node process under 200 MB of memory used from the system perspective.

Recommended reading: https://nodejs.org/api/stream.html

Here is an example of an upload stream that I created for the use case of processing a large multi TB data set and piping the result up to Amazon S3: https://www.npmjs.com/package/s3-upload-stream

link

amelius 4083 days ago

For streaming, I can see that this can work.

But basically, what I wanted to do, is implement a module that works as an index between threads (e.g., a search-tree for fast lookup). However, since in Node.js all threads are in a separate process, it is (afaict) impossible to make this efficient, as processes do not share data.

link

NathanKP 4083 days ago

So in Node.js this would be accomplished by using a shared data store like Redis. For example I run eight processes per c3.xlarge instance, and the instances share a Redis which contains data like that. Particularly indexes could be stored in the Redis hash structure.

Basically Node.js is designed around the concept of microservices and separation of concerns. Rather than doing everything in one giant, multithreaded monolithic process you break your service up into loosely coupled components that talk to each other via messaging and share common datastores. Some people really like this pattern (I'm a strong advocate of it myself) because it scales really, really well.

link

amelius 4083 days ago

Well, the "index" was merely an example. Actually, what I want to do is implement persistent data structures (a.k.a. functional or immutable data structures) in a combination of javascript and C++. See [1]

[1] http://en.wikipedia.org/wiki/Persistent_data_structure

link

warfangle 4083 days ago

The `servicebus` module is a really cool way to coordinate events between microservices, especially if they don't necessarily "know" about each other.

link

sigzero 4083 days ago

It sounds like nodejs is not what you want then.

link