| I’m biased since I’m an owner of a web scraping agency (https://webscrapingsolutions.co.uk/). I was asking myself the same question in 2019.
You can use any programming language, but have settled on this tech-stack Python, Scrapy (https://github.com/scrapy/scrapy), Redis, PostgreSQL. for the following reasons: [1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners. [2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them. [3] There is a strong community that can help with most of the questions (I don't think any other alternative has that). [4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since. [5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks. [6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (https://www.zyte.com/scrapy-cloud/) is good and cheap enough for 99% of the projects. [7] If you decide to have your own infrastructure, you can use https://github.com/scrapy/scrapyd. [8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs. [9] It’s easy to integrate your own AI/ML models into the scraping workflow. [10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using https://github.com/rmax/scrapy-redis. [11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own. We have built dozens of projects in multiple industries: - news monitoring - job aggregators - real estate aggregators - ecommerce (anything from 1 website, to monitoring prices on 100k+ domains) - lead generation - search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail) - macroeconomic research & indicators - social media, NFT marketplaces, etc So, most of the projects can be finished using these tools. |