Hacker News new | ask | show | jobs
by pdimitar 1006 days ago
Does not seem exactly like an easy way to me. Not super hard, surely, but not "easy". More like "moderately easy to do and a bit annoying to implement".

Probably 20% of the effort shown in this post could have been expended to just write something very similar in Golang, and it would have taken less time, too. Because the way I see it this is trying to emulate futures / promises (and it looks like it's succeeding, at least on the surface). That can spiral out of comfortable maintainable code territory pretty quickly.

But especially for something as trivial as a crawler, I don't see the appeal of Python. You got a good deal of languages with lower friction for doing parallel stuff nowadays (Golang, Elixir, Rust if you want to cry a bit, hell, even Lua has some parallel libraries nowadays, Zig, Nim...).

2 comments

If you already know Python, the advice in this article is certainly a lot easier and more actionable than "just learn Go or Rust or Zig instead".
Certainly. My point is that if you need to write that much code and/or do that much research, at one point the effort of doing it in another language will be less than to keep insisting on using a tool that's not designed for it.

It happened with me and many other former colleagues.

Though obviously, everyone decides for themselves when does that point come -- or if it comes at all.

The point of the article is a handful of lines. The rest is accoutrement like the URL list and timing code. But sure, if

    tasks = {}
    for url in URLs:
        future = executor.submit(fetch_url, url)
        tasks[future] = url
bothers you, this is perfectly (some would say more so even than the original) Pythonic:

    tasks = {executor.submit(fetch_url, url): url for url in URLs}
I have found another way in the documentation for `concurrent.futures`. You can use `Executor.map` (https://docs.python.org/3/library/concurrent.futures.html#co...). It eliminates the need to wait on the futures explicitly.

  def main():
      with ThreadPoolExecutor(max_workers=len(URLs)) as executor:
          for url, title in zip(URLs, executor.map(fetch_url, URLs)):
              print(f"URL: {url}\nTitle: {title}")
The default value of `max_workers` since Python 3.8 has been

  min(32, os.cpu_count() + 4)
You should probably avoid

  max_workers=len(items_to_process)
It will not save memory or CPU time when you have few items (workers are created as necessary) and may waste memory when you have many.
As a side note, using a future as a map key struck be as a bit weird, though perfectly valid. It'd be more natural IMO to use a list for the futures, and have the fetch_url function return a (url, result) tuple. Or use the url as the map key and just iterate over the map items instead of using as_completed on the keys
What “much research” are you talking about?

The amusing part is that the article calls out two groups of people into which your advice falls.

It’s not that much code, it’s about 4 lines of code, creating a “pool” and calling a wait on future objects.

This is a perfect solution for Python developers who have been perfectly happy using Django for years, and just need to scrape some API or download multiple files.

No, they shouldn’t switch to a different language the moment they need to optimize something embarrassingly parallel, they can see whether a simple solution in stdlib is enough, and probably move on.

If this is too much research for you, wait until you have to deal with the many problems of Go channels in the real world. (Reasonably well-known though controversial article: [1]) Don't even get me started on Rust. Concurrency and parallelism is hard.

Yes, I've written a shit ton of code in all aforementioned languages.

[1] https://www.jtolio.com/2016/03/go-channels-are-bad-and-you-s...

> and/or do that much research,

Is reading the official docs section on concurrency lots of research?

Python is surprisingly bad at parallelism, for a data or framing workhorse.

What TFA doesn't say is that process pools are quite fragile, certainly on Mac and Windows, but Linux also. They rely on pickling which is also fragile.

That said, asyncio works surprisingly well if what you want is non-blocking execution and are happy with 1 cpu. But no parallel speed up.

After learning clojure, I found python's approach to concurrency terrible at best. Clojure is extremely easy to understand. It has basically three solutions, each for clear and defined use cases. It's much easier to judge what you should implement given a particular problem and how to do it.

I wish Python had similar solutions.