Hacker News new | ask | show | jobs
by ralusek 3636 days ago
None of the problems you have seem to be related to language, let alone whether it's asynchronous, synchronous, multithreaded or single threaded. What you've described here sounds simple enough from an outside perspective, so it's more than likely just time to reconsider your architecture.

So let's talk about solving this with NodeJS. First of all, you've allowed this to get overwhelming to you because you consider this whole thing to be a monolithic service, when it's really just time to break it into modules. It sounds like you have a scraper, a parser, some application which needs the scraped data, and you have a web API for that application. Cool, easy enough.

So the first thing I would do is pull out your scraper into a completely different service which has nothing to do with your API. Just have it be a javascript class which takes some configurations to manage the scraping settings, and that's about it. All you really need it to do is have a method which takes some input which tells it what site to scrape, and it will then give you the output.

Cool, now you need a parser. You need to go through the html returned by your scraper and format it as JSON. Fortunately, Javascript has a lot of tools very well suited for dealing with HTML. A tool I like is called Cheerio, which makes it very easy to parse HTML and get the data you need from it. So you have a parser now.

All that's left is integrating with your application now. Based off of you saying that you need to "scrape data live and based on certain queries", what we want is to expose some routes in your application which make that possible.

If you tell me a little bit more about what your application does, I can give you some idea of how I would structure this from here.