Hacker News new | ask | show | jobs
by qsort 1057 days ago
The big missing item from the list: generators!

Using "yield" instead of "return" turns the function into a coroutine. This is useful in all sorts of cases and works very well with the itertools module of the standard library.

One of my favorite examples: a very concise snippet of code that generates all primes:

  def primes():
      ps = defaultdict(list)
      for i in count(2):
          if i not in ps:
              yield i
              ps[i**2].append(i)
          else:
              for n in ps[i]:
                  ps[i + (n if n == 2 else 2*n)].append(n)
              del ps[i]
6 comments

And this is a presentation explaining why generators may be extremely useful for all kind of data pipelines: https://www.dabeaz.com/generators/Generators.pdf

If you don't know it already, it is really worth looking into. I am a python dev with nearly a decade of experience and I knew generators, and yet this was still an eye opener.

Note that despite this being a python-specific slide deck, generators and iterators are also present in many other languages, including but not limited to Rust and JS.

The concepts matter more than the chosen language in this deck.

I learned a lot! Looks like I can apply this to a PHP trace/profile parser project, especially the pipelined parsing and the query language idea.

Wow, thanks for that -- that's an excellent slide deck.
But wait, there's more, you can send data back to the function! (Will be returned as the yield output)

https://stackoverflow.com/questions/20579756/passing-value-t...

And don't forget "yield from" (same as yielding all values in a list, but keeps the original generator! You can send data back to the list if it is itself another generator!)

Anyone have good examples of how/when to actually use this? I've personally never interacted with or written a generator that expects to receive values.
I actually had a great use case for this last week. Needed to flatten a list of nested dicts, e.g.:

  [
    {"name": "/dev/loop0"},
    {"name": "/dev/loop1"},
    {"name": "/dev/loop2"},
    {
      "name": "/dev/sda",
      "children":
        [
          {
            "name": "/dev/sda1",
            "children":
              [{"name": "/dev/mapper/lubuntu--vg-root"}, {"name": "/dev/mapper/lubuntu--vg-swap_1"}],
          },
        ],
    },
    {"name": "/dev/sdb", "children": [{"name": "/dev/sdb1"}, {"name": "/dev/sdb2"}]},
    {"name": "/dev/sdc", "children": [{"name": "/dev/sdc1"}, {"name": "/dev/sdc9"}]},
  ]
Wound up writing a recursive generator (with some help from #python on IRC):

  def flatten(items):
      for item in items:
          yield {k:v for k,v in item.items() if k != 'children'}
          if 'children' in item:
              yield from flatten(item['children'])
which results in:

  [{'name': '/dev/loop0'},
   {'name': '/dev/loop1'},
   {'name': '/dev/loop2'},
   {'name': '/dev/sda'},
   {'name': '/dev/sda1'},
   {'name': '/dev/mapper/lubuntu--vg-root'},
   {'name': '/dev/mapper/lubuntu--vg-swap_1'},
   {'name': '/dev/sdb'},
   {'name': '/dev/sdb1'},
   {'name': '/dev/sdb2'},
   {'name': '/dev/sdc'},
   {'name': '/dev/sdc1'},
   {'name': '/dev/sdc9'}]
I see your function and "yield" (pun definitely intended) the following:

    def flatten(children=[], **other):
        if other: yield other
        for child in children: yield from flatten(**child)
That's pretty brilliant to use `children` as the keyword name, thanks!
Thanks for the example, but I was more looking for something that uses "generator.send(...)". I definitely agree that yielding items out of generators is extremely useful, but not so sure on examples of generators that are sent values.
This is the basis of most older async frameworks (see: Tornado, Twisted). A while ago I put together a short talk on how to go from this feature -> a very basic version of Twisted's @inline_callback decorator.

https://github.com/ltavag/async_presentation/tree/master

Anything with feedback control. Updating a priority queue's weights, adaptive caching, adaptive request limiting, etc. Ironically it looks like HN itself rate limited me the first time I tried to reply lol
I am a python noob and this is going to take me some time to process.
Best way to think about it is that a generator can throw some questions back to the caller. It always looks a bit messy though.

    question_bank={'1+1' : '2', '2+3' : '5'}

    def Quiz():
        for question, correct_answer in question_bank.items():
            answer = yield question
            if answer == correct_answer:
                print('Correct!')
            else:
                print('Wrong.')
        yield 'Finished!'
                
    question = Quiz()
    q = next(question)
    while q != 'Finished!':
        q = question.send(input(q))
I like using generators when querying APIs that paginate results. It's an easy way to abstract away the pagination for your caller.

  def get_api_results(query):
    params = { "next_token": None }
    while True:
      response = requests.get(URL, params=params)
      json = response.json()
      yield from json["results"]
      if json["next_token"] is None:
        return
      params["next_token"] = json["next_token"]
  
  for result in get_api_results(QUERY):
    process_result(result)  # No need to worry about pagination
Thanks! I tried to add mostly the stuff I don't encounter that often in blogs/tutorials etc. But guess you are right. Generators, or at least the 'yield' keyword, is often misunderstood, and we can't emphasize them enough
Just to clarify, I don't mean your article is bad or incomplete -- quite the contrary, I enjoyed it a lot. Generators are one of my favorite Python features and they're kind of underused, mostly because people simply don't know about them.

A couple more along the same lines:

- Metaclasses and type. (This is admittedly dark magic, but useful in library code, less so in application code)

- Magic methods! Everyone knows about __init__, but you can override all sorts of behaviors (see: https://docs.python.org/3/reference/datamodel.html)

My favorite example (I have a lot of favorite examples :)) is __call__, which emulates function calling and is the equivalent of C++'s operator().

Why is it my favorite? Because as the old adage goes, "a class is a poor man's closure, a closure is a poor man's class":

  class C:
      def __init__(self, x):
          self.x = x
      def __call__(self, y):
          return self.x + y
 
  >>> a = C(2)
  >>> a(3)
  5
Thanks a lot! Really appreciate it. Love the example! Haven't used the dunder __call__ yet (like many magic methods I guess), but that's a nice one!

I didn't have to use Metaclasses, either, though I have read about them, especially in Fluent Python. But I guess I belong to the 99% who haven't had to worry about them, yet :P

I find that __call__ is very confusing, but maybe because I'm not used to seeing if often.

What is the benefit compared to having a method named "add" that also explains the behavior?

If an object is callable you can use it in places that might conventionally expect functions. The utility of that is very situational, though. I've only used it a handful of times myself over the years I've known and used Python.

It may also give you a "clearer" (in quotes because subjective) presentation for something you're trying to do.

I see it a lot in HuggingFace, and use it myself for classes that are used like a function, especially when the obvious method name is the verb form of the class name

    processor = SomeProcessor.load("path/to/config")

    # with __call__
    processed_inputs = processor(inputs)

    # less awkward than
    processes_inputs = processor.process(inputs)
The only benefit is to the human, same as @property or even @dataclass.
Thanks for writing that up! I disagree though, I prefer the processor.process for clarity, and for not adding another way of doing things that regular methods already do.
I think I figured out that count(2) is from itertools? I'm new to python.

I think you could simplify the rest like so:

    def primesHN():
        from collections import defaultdict
        from itertools import count
        yield(2)
        ps = defaultdict(list)
        for i in count(3,2):
            if i not in ps:
                yield(i)
                ps[i**2].append(2*i)
            else:
                for n in ps.pop(i):
                    ps[i + n].append(n)
> I think I figured out that count(2) is from itertools?

It is. Itertools is a masterpiece of a module. It has a lot of functions that operate on iterators and will work both on standard iterables (lists, tuples, dicts, range(), count() etc.) and on your own generators. It forms a sort of "iterator algebra" that makes working with them very easy.

> I think you could simplify the rest like so:

Sounds good, but with a caveat: you do need to call "del" at the end for memory deallocation purposes. The garbage collector isn't smart enough to know you won't be using those dictionary entries any longer. Technically the code still works, but keeping everything in memory defeats the purpose of writing a generator.

> you do need to call "del" at the end

The garbage collector doesn't understand "pop"? That seems...dumb? ¯\_(ツ)_/¯

can you explain how generators work with multiprocess (Thread based pool) ?

is ps internal variable unique for each Thread or same?

is it safe to execute your primes() from different threads?

> can you explain how generators work with multiprocess

The best way to think of a generator is as an object implementing the iteration protocol. They don't really interact with concurrency, as far as multiprocess is concerned, they're just regular objects. So the answer is that it depends on how you plan to share memory between the processes.

> is ps internal variable unique for each Thread or same?

ps is local to the generator instance.

  def f():
      x = 0
      while True:
          yield (x := x + 1)
 
  >>> f()
  <generator object f at 0x10412e500>
  >>> x = f()
  >>> y = f()
  >>> next(x)
  1
  >>> next(x)
  2
  >>> next(y)
  1
> is it safe to execute your primes() from different threads?

For this specific generator, you would run into the GIL. More generally, if you're talking about non CPU-bound operations, you need to synchronize the threads. It's worth looking into asyncio for those use cases.

A yield will simply return a generator object, which contains information about the next value to use, and how to continue the function execution. That's why you need to use functions that yield things inside loops or list(...).

If you run it from different threads I guess it will be the same as calling the function multiple times, it will return a new started-from-the-top generator.

    def sum():
        yield 1
        yield 2
    print(repr(sum()))
    print(next(sum()))
    print(next(sum()))
Prints

    <generator object sum at 0x7fc6f14823c0>
    1
    1
so Thread based based pool will have same instance of generator, while Process based pool with have unique instance of generator?
In this example, calling sum() creates a generator and returns it. Say g = sum(). If you share g between threads, they will all use the same generator object! If you call sum() separately per thread, they will be different generators.

If you try to send g to a different process, you will get an error, because it doesn't serialize.

I don't know if a generator can be shared across threads, but in that case ... I have no idea :/

You'll need to search, or try!