Hacker News new | ask | show | jobs
by livando 2898 days ago
"Multi-threading is a must, when scraping at scale."

I disagree on this point. Starting with a single threaded model allowed my team to scale quickly and with little additional overhead. What we have lost with performance we gained in simplicity and developer productivity. That being said tuning and porting portions of the app to a multi-threaded system is slotted to take place within the next year.

Start with single threaded and simple, move to multi-threaded scrapers when the juice is worth the squeeze.

1 comments

Or use a language where fully utilizing all CPU cores is transparent, like Elixir? There's zero complexity, you basically add 4-5 lines of code and that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.

I'm assuming you're talking about Python, which is also "4-5 lines" to use multithreading or multiprocessing. Can you explain what's wrong with GIL languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)

When the pooled functions are I/O bound then the GIL is not a problem. Any GIL language will do.

However, for example when generating reports, try use the same instrument for serializing 4 pages of DB records to 4 pieces of a big CSV file, each working on a single CPU core. There the languages without GIL truly shine. And languages like Python and Ruby struggle unless their GIL implementations compromise and yield without waiting for an I/O operation to complete.

I'm not sure you understand how the GIL works in Python. If you're using multiprocessing, there's no locking across the code executing on each core. Also, if you're writing to the same file from four processes, you're going to need locking.
What I have last known is that GIL languages work well in multicore scenarios as long as all N tasks have I/O calls that serve as yielding points for the interpreter, and they do not use preemptive scheduling like the BEAM VM (Erlang, Elixir, LFE, Alpaca) do.

Am I mistaken?

As far as Python goes, yes. Multicore implies multiple processes, which means that each process will have it's own Python interpreter, each with it's own GIL.

If you were to use multithreading instead, you would generally have a problem if you were doing non-I/O work.

Any further information on this? Last I looked (which was a while ago), the infrastructure like HTML parsers seemed surprisingly tricky in Elixir.
The only complication is if you want to use Meeseks (https://github.com/mischov/meeseeks) which requires the Rust compiler and runtime be installed because it has native bindings. Meeseks is useful because it's a bit faster than the default Floki (https://github.com/philss/floki) and because it can handle very malformed HTML.

As for Elixir itself, here's a quick example:

```

# Assume this contains 1000 URLs

urls = [....]

# This will utilize 100 threads; if the second parameter is omitted, it will use threads equal to CPU cores. For I/O bound tasks however it's pretty safe to use much more.

results = Task.async_stream(&YourScrapingModule.your_scraping_function/1, max_concurrency: 100)

```

It's honestly that simple in Elixir. For finer grained control the line count is little bigger -- but little. Not hundreds of lines for sure.

Meeseeks's speed difference with Floki is not that significant, and my initial findings are they've leveled out even more with OTP 21, sometimes even swinging in favor of Floki.

The better handling of malformed HTML by default is the much bigger deal.

Thank you man (I know you are the author of Meeseks), I didn't know that. Always knew that the current info was the Meeseks was faster than Floki but it seems that OTP 21 largely eliminated that as you said.

Valuable info, thanks!

It was pretty interesting to see Floki get a lot faster and Meeseeks actually get a little slower with OTP 21. I'll enjoy figuring out why. I hope to get a chance to work on the OTP 21 performance of Meeseeks before too long.

On the plus side there were some nice memory improvements for Meeseeks in OTP 21.