Hacker News new | ask | show | jobs
by amelius 4084 days ago
I recently got interested in nodejs. However, then I discovered that:

1. It doesn't support threads (facilitating structural sharing of large data-structures between parallel tasks, which cannot be done using ordinary processes).

2. The module-loading mechanism ("require()") natively doesn't support delayed loading, which is needed when loading from within a browser. Yes, there is the "browserify" package, but, come on, something as basic like this should be supported out of the box. Especially considering the fact that there is a "http" module hardwired inside nodejs (why isn't this a separate npm module, btw?)

3. To make my own privately held modules and install them properly, I have to run a npm server? This seems like an awful lot of work for something as basic as this. Ok, so now I can use the cloud for this, but come on, I should be able to do this just from within the filesystem, like e.g. git does it.

For people interested, one can use the package "sinopia" for hosting your own private modules. It seems to be a pretty decent package, but be aware that the authentication settings out of the box are completely insecure.

3 comments

1. Node.js is (for the most part) single-threaded; that's its draw. It's not trying to be a swiss army knife, and if your use case requires a threaded language then Node.js certainly isn't the tool for that job. But it might find a useful place in your toolbox for other tasks.

2. require() is part of the CommonJS spec, and how it physically works is dependent on the implementation. You point out that Node's implementation doesn't work well in the browser, but Node itself does not work in the browser so that point is moot. I agree that it might be interesting to load remote modules in Node, but keeping that operation synchronous does simplify the language quite a bit.

3. You can also map modules to public or private git repositories in the package.json, as long as the private key used during npm install has access. If the git repo has tags, a tag can be specified in the git uri as well. Private npm repos are the superior way to distribute private modules with wider access, but I think this is handled fairly cleanly already.

Thanks for your comments. Some remarks here.

1. Node.js is a tool for building servers. On a server you generally cannot afford to have the event loop blocked by a computational intensive task. You need threads.

2. It would only require a "promise" to make the module-loading asynchronous. Leaving that out is not what I would call "quite a bit of a simplification", especially if using asynchronous callbacks is the "modus operandi" of programming on the Node.js platform itself.

3. Okay, I stand corrected. I remember that I waded through the documentation quite a bit though, trying to figure this out.

On a server you generally cannot afford to have the event loop blocked by a computational intensive task.

You are not supposed to use your main event loop for computational intensive tasks.

Offload those tasks to separate workers and use queues.

That's node's basic knowledge. Its a trade off that you're supposed to be aware of when using node.

The problem with workers is that they don't have an event-loop (like the main thread). So it is not possible to use asynchronous code written for the main thread in those worker threads, which is of course quite limiting.

EDIT: I mean workers which run in a thread (as opposed to in a process). An example is given by the webworker-threads npm module. Threads allow one to structurally share large data-structures, so one does not have to serialize them when calling a worker (serializing large structures would block the main thread).

Sorry you are getting a lot of downvotes. For what it is worth I don't think you deserve them, as your comments just show inexperience and lack of understanding of Node.js, and aren't trolling. However, I think you would be well served by doing some research into what Node.js and and how it works. Basically every Node.js process has an event loop. Your workers have an event loop just like your servers do.

Here is how a typical node stack works:

Nginx load balancer talks to a cluster of node server processes, one per core. The server processes handle all incoming web requests that won't block the event loop. On a typical REST server this is 99% of your tasks, and each node process can handle thousands of concurrent requests due to the way that the event loop works.

If there is a heavy, blocking task like processing an image or PDF file, (although even these things should be able to be done in a nonblocking stream manner) the server processes send a message through a background queue such as RabbitMQ, or Amazon SQS or the like to a background process which has the sole purpose of processing heavy tasks pulled from that queue. Fundamentally if you are using Node.js properly you don't need multiple threads. Instead you use multiple processes, and the processes are essentially "threads" that can talk to each other either using parent/child processes communication, HTTP, redis pubsub, or any other mechanism you want.

But there is no reason why anything should block a Node.js process if it is written properly. I've even done heavy video transcoding in a streaming manner in a Node.js process without blocking the event loop.

The reason for the downvotes, I suspect, is because this looks like an attempt to derail a thread to get tech support on a barely-related topic. Worse, the initial comment was worded as "this thing sucks because..." instead of a question, despite showing very little knowledge about the thing it complained about.
Thanks for the explanation and the moral support :)

I think most people here misread the line "facilitating structural sharing of large data-structures between parallel tasks, which cannot be done using ordinary processes" in my first post.

And by large data-structures, I don't necessarily mean structures which can be "naturally" streamed. I'm thinking more of a large index, for example, which can be used for fast lookup, and be used from several threads at the same time.

Having processes (here named workers) is a nice feature, but doesn't cut it when you want to share large amounts of data between threads (serializing that data would completely block the main thread). In my view, it is unfortunate that the designers of Node.js didn't opt for having multiple threads as opposed to putting every thread in a separate process.

Node workers are just new processes. They do have an event-loop.
@1. You don't need threads if workers are enough for you, that's how you should do computation intensive tasks...
Computation intensive tasks often take large amounts of data as input. And sharing data with a worker always has to be done by serializing this data (in a message). So for large inputs, this approach doesn't work (the main thread would block the cpu while serializing the messages).

But my biggest problem with workers is that they don't have an event-loop, so I can't share asynchronous code between the main thread and the workers.

There is no need to serialize large amounts of data. The way it is designed to work in Node is you use a stream. So for example lets say you have a multi TB data dump in Amazon S3, you want to process it, and then upload a transformed multi TB result set back to Amazon S3. (This is something I've worked on before).

The way it works is you open a download stream from S3, pipe it into a Node.js transform stream, and then pipe that stream into an upload stream that uploads the data back to S3 using the multipart upload API.

The Node.js design is very much like using Unix pipes. You can pipe a huge multi TB file through grep without blocking anything. The data just streams from disk into the grep process, grep filters it down to things that match, and then streams the results onto the screen.

Computation on huge streams in Node.js works the same way. Your event loop remains unblocked even when operating on a stream TB's in size because you are only ever touching a portion of the dataset at a time. Additionally if you do it properly your overall memory usage remains low as you are exporting the data back out of the machine as fast as it comes in. I've used this technique to process streaming data many GB in size while keeping the node process under 200 MB of memory used from the system perspective.

Recommended reading: https://nodejs.org/api/stream.html

Here is an example of an upload stream that I created for the use case of processing a large multi TB data set and piping the result up to Amazon S3: https://www.npmjs.com/package/s3-upload-stream

For streaming, I can see that this can work.

But basically, what I wanted to do, is implement a module that works as an index between threads (e.g., a search-tree for fast lookup). However, since in Node.js all threads are in a separate process, it is (afaict) impossible to make this efficient, as processes do not share data.

It sounds like nodejs is not what you want then.
One can also install modules from a git repo, from which access to the module can be controlled in the regular git fashion:

  "dependencies": {
    "private-module": "git+ssh://..."
  }
or:

  "dependencies": {
    "private-module": "git+https://<user>:<password>@..."
  }

If one chooses to use Github, there is also the option to use an auth token in the url scheme instead of needing to distribute an SSH PK (bad) or having login credentials in the package.json:

  "dependencies": {
    "private-module": "git+https://<token>:x-oauth-basic@github.com/<account>/<repo>.git"
  }
One could also just have a folder of private modules mounted from some shared file server or whatever. An NPM server is not an absolute requirement to use npm.
> (why isn't this a separate npm module, btw?)

It is[0]. It's a dependency of browserify.

If you don't use it, part of the concat+minification step prunes dead code.

> To make my own privately held modules and install them properly, I have to run a npm server? This seems like an awful lot of work for something as basic as this. Ok, so now I can use the cloud for this, but come on, I should be able to do this just from within the filesystem, like e.g. git does it.

If you don't care about having your package published somewhere other than your local machine, you can use npm link[1], or `npm install <directory>`[2].

0. https://github.com/substack/http-browserify/

1. https://docs.npmjs.com/cli/link

2. https://docs.npmjs.com/cli/install