Show HN: Array – A Better Python List | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Array – A Better Python List (github.com)
	87 points by lauriat 1992 days ago

14 comments

pedrovhb 1992 days ago

I think this is neat but I'm not sure it's the best way to go about things.

> all(map(func3, filter(func2, map(func1, zip(a, b)))))

> a.zip(b).map(func1).filter(func2).forall(func3)

The original is indeed terrible and the second version is a bit better. A lot better than either one, though, is splitting your logic into multiple lines and assigning a descriptive identifier to each step. Maybe even throw in some inline comments if you're particularly respectful of others' time.

As tempting as it is to do something super clever and cram a ton of functionality into a small number of lines or characters (it does feel good), it's just better to be a bit more verbose and write simple, obvious code. I feel like code should be read like a book, not a puzzle.

brundolf 1992 days ago

What I like about "cramming a ton of functionality into [a single expression]" is that it doesn't leak any intermediates to the rest of the block, and it doesn't allow for mutation. There's a single output exposed; you can't accidentally use the wrong value downstream. You could wrap it all in an inner function, I guess, but that seems like overkill unless you plan to reuse it.

Though to be fair, having explicit intermediate variables is idiomatic in Python, from what I've seen. It's one of my biggest pet-peeves about the language, but it's not without precedent.

techdragon 1992 days ago

This is exactly the main situation where I'll happily "get clever" with my code.

It's not being reused and one of the following is true... I don't want to leave behind intermediary objects for whatever reason is relevant, or I feel its worth it to compress the logic to make it possible to use a language feature that requires an expression, like lambdas or list/dict comprehensions.

bko 1992 days ago

> a.zip(b).map(func1).filter(func2).forall(func3)

Lets make this a somewhat concrete example.

---

heights = [1,2,3]

widths = [4,5,6]

# printing area greater than 10

# functional

heights.zip(widths).map(to_area).filter(lambda area: area > 10).forall(lambda a: print("Area " + a)

#Verbose way

hw_zipped = zip(a,b)

areas = hw_zipped.map(to_inches)

big_areas = areas.filter(a: a > 10)

for a in big_areas: print("Area " + a)

---

Which do you prefer? I would argue the right level of abstraction is the functional way in this example, and its often the case in my experience, especially in python where you don't often use a namespace to store these intermediary variables and you have can't rely on typing

claytonjy 1992 days ago

As another point of comparison, as of python 3.8 you can do this in one list comp without nesting or double-computing areas with the walrus:

    result = [area for x,y in zip(heights,widths) if (area := to_area(x,y)) > 10]

I don't think that's very easy to read; I'd opt for two list comps like

    areas = [to_area(x,y) for x,y in zip(heights,widths)]
    result = [area for area in areas if area > 10]

But I agree with OP that map+filter is easier to read.

bko 1992 days ago

I agree. My main problem is I don't want intermediary variables floating around. Especially something like "areas". If python localized variables to a blocked namespace, I wouldn't mind

In scala:

---

val widths = Seq(1,2,3)

val heights = Seq(4,5,6)

widths.zip(heights).foreach { case (w, h) => {

  val area = w * h

  if (area > 10) {

    println(s"Area: ${area}")

  }

}}

println(area) // error: not found: value area

joshuamorton 1992 days ago

You can do this without the walrus in a one liner as well, I believe:

    [area for area in (to_area(x, y) for x, y in zip(h, w)) if area > 10]

or generally, you can take a multiline statement like the one you have and replace named value with its expression. Add some indentation and it's not too bad:

    [area for area in 
     (to_area(x, y) for x, y in zip(h, w))
     if area > 10]

syrrim 1992 days ago

  for x, y in zip(a,b):
      area = to_area(x, y)
      if area > 10:
          print(f"Area {area}")

>in python where you don't often use a namespace to store these intermediary variables

Hm? Most python code is within a function, in my experience.

bko 1992 days ago

You can abstract it out to a function but I think its overkill, even if you generalize to something like print_area_filter(heights, widths, value, cmp) or whatever

If its not in a function, your example may (or may not depending on length if either a or b have a length of zero) create a floating variable called area out there.

lauriat 1992 days ago

I agree, and yes, the line may be a bit excessive. The idea of Arrays is not just to cram a heap of functions to a single line. The readability (at least to me) is improved even with e.g. a single map

  arr.map(func)

vs.

  list(map(func, arr))

snicker7 1992 days ago

> assigning a descriptive identifier to each step

Working with data scientists, in practice, these identifiers are usually "arr1", "arr2", &c. I'd rather have method chaining. Often the intermediates are not meaningful.

disgruntledphd2 1992 days ago

I agree with you in general, people (especially data scientists) are bad at naming things.

It's probably the core skill of good programmers though, so it should be taught more. I don't think anyone sets out to use misleading names, but it's easy for name and code to diverge, and it's crippling to readability.

However, often when refactoring/updating such data scientist code (or even understanding), I need to break apart the long method chains, and this is much, much more annoying than dealing with crummy names.

At least I can print the values associated with the names, which is not easily possible in the really long method chain.

derwiki 1992 days ago

Code is read more often than it’s written; optimize for reading.

dragonwriter 1992 days ago

> As tempting as it is to do something super clever and cram a ton of functionality into a small number of lines or characters (it does feel good), it's just better to be a bit more verbose and write simple, obvious code.

I find fluent style often clearer as well as more terse than with superfluous intermediate variables. Verbosity isn't the same thing as clarity.

(But in Python, comprehensions/genexps are often clearer than either.)

ElevenPhonons 1992 days ago

Are these really the same?

The idiomatic Python 3 version uses generators to compose the computation and to avoid unnecessary memory allocations. Does funct.Array also do this?

- https://docs.python.org/3/library/functions.html#map - https://docs.python.org/3/library/functions.html#filter

6gvONxR4sf7o 1992 days ago

You can split the a.b.c.d onto different lines and comment each, which is a decent middle ground sometimes (a\n.b\n.c\n.d). A problem, still, is exceptions and debugging. You get paged and see that something went wrong in that expression that does so many different things, and it’s much more frustrating to track down the bug. It makes step debugging trickier too. I’d love better error message/debugger support for that kind of programming.

rowanG077 1992 days ago

I disagree with this. Splitting this simple pipeline into more variables makes stuff a lot less readable. Splitting it into variables would very clearly indicate to me the intermediate computations are used elsewhere. Which wouldn't be the case here.

Phemist 1992 days ago

This feels luke a strawman example. I feel like list comprehension results in a much more readable example here. I think, at least.

> all(func3(a) for h,w in zip(a,b) for a in func1(h,w) if func2(a))

lauriat 1992 days ago

Fair enough. Readability is subjective but I understand the sentiment. Constructing list comprehensions of such long chained expressions can be rather tedious and error prone, though (as your example shows).

Immortal333 1992 days ago

Chaining has its own benefits. But I think this doesn't fit the definition of "Pythonic". Again, "Pythonic" is highly debatable. But, You can always break down big chain of operations, into smaller chain using good variable naming in-between.

Many operations are implemented as iterator in python on list, like filter, groupby. Looking at your code, its looks like you're not doing lazy computation. (Correct me if I wrong). This could be huge performance impact, depending upon use case of list.

lauriat 1992 days ago

I understand the unpythonic nature of Arrays may startle some hardcore pythonistas, but ability to chain functions was one of the main reasons why I wrote the package as I find nested function calls ugly and sometimes rather hard to decipher.

Regarding the perfomance, Arrays aren't meant to be super high performing but rather a simple way to manipulate sequences. For the best performance you should go with generic python, toolz or other.

nerdponx 1992 days ago

I am with you on this. Personally, I would rather continue using Toolz (https://github.com/pytoolz/toolz), and contribute additional helper/utility methods to that library.

The whole point of some things being functions versus methods is that they are generic rather than specialized. The generic iterator protocol is probably the best feature about the Python language, and it's both a damn shame and bad design to not use it.

If you really wanted to make an improvement over built in lists, the thing to do would be to implement some kind of fully lazy "query planning" engine, like what Apache Spark has. Every method call registers a new method to be applied with the query planner, but does not execute it. Execution only occurs when you explicitly request it. That way you can effectively compile in efficient but readable code that takes multiple passes over the data into efficient operations internally that only make one pass, or at least fewer passes. This also naturally lends itself to parallelization/concurrency.

jmuhlich 1992 days ago

Dask does the lazy evaluation and query planning thing on numpy arrays and pandas dataframes, and can execute in parallel. It mimics most of their native interfaces which makes it a pretty easy drop-in.

https://docs.dask.org/en/latest/

feanaro 1992 days ago

> But, You can always break down big chain of operations, into smaller chain using good variable naming in-between.

I don't think so. Very frequently the intermediate values represent nothing in particular and naming them simply results in visual noise.

I think this is comparable to SQL or LINQ statements. Consider what those would look like if you had to name every intermediate values instead of being able to filter and group on-the-fly.

Of course you can make a mess out of those too, by building huge unreadable expressions, but that's also an extreme, similar to naming every intermediate step.

nerdponx 1992 days ago

I feel obligated to point out the existence of the "array" package in the Python standard library: https://docs.python.org/3/library/array.html

I'm sure the author is aware of it, but readers might not be.

lauriat 1992 days ago

That's why the A is capitalised ;)

jamespwilliams 1992 days ago

Looks cool.

    bool (__bool__) Returns whether all elements evaluate to True.

I’d be worried that this will trip people up who use the

    if l:
        print l[0] # or whatever

pattern

fantod 1992 days ago

To be fair, using "if something" in Python is pretty much always a good way to trip yourself up.

nemetroid 1992 days ago

I've yet to see a (popular) style guide recommend against "if something:".

fantod 1992 days ago

I don't really see how this is a reason not to carefully consider the type of an object when using truthiness.

pansa2 1992 days ago

PEP8 recommends using `if seq:` instead of more verbose alternatives like `if len(seq):`.

fantod 1992 days ago

If you know that "something" is a sequence, then is the idiomatic thing to do. The point is that any time you rely on truthiness of a value, you need to think (sometimes quite carefully) about what type of value you're dealing with.

orf 1992 days ago

For good reason - len(something) or alternatives might be expensive to compute, bool(something) is actually what you are trying to do and can be optimised depending on the container.

fantod 1992 days ago

For the basic sequence types (list, tuple, and range), len is definitely not expensive to compute. For custom types, it will depend on your implementation of __len__ (but then, computation of bool(...) will also depend on your implementation of __bool__).

orf 1992 days ago

Yes, len() with the basic types are not expensive, and it can vary based on the container implementation, but that’s not really the point.

The reason you should use “if x” is the same reason you should use “if x not in y” rather than “if not x in y”. It better expresses the semantics of your operation with the side effect that it may be faster.

lauriat 1992 days ago

Thanks!

Good point. However setting

  def __bool__(self): return self.nonEmpty

would mess up certain methods e.g. .index for nested Arrays as __eq__ is computed elementwise and bool(Array(False, False)) would evaluate to True.

Maybe a warning would be appropriate? (as is the case with ndarrays)

pansa2 1992 days ago

> bool(Array(False, False)) would evaluate to True

Isn't that consistent with the built-in `list`, though, because `bool([False, False])` is True?

lauriat 1992 days ago

My explanation was pretty poor, let me rephrase

For example, when calling

  Array((x, y), (z, w)).index((z, w))

the following piece of code is executed

  bool(Array((x, y)).__eq__((z, w)))
  = bool(Array(False, False))

If __bool__ returned whether the Array is nonempty, bool(Array(False, False)) would evaluate to True and the method would wrongly return 0.

You're right that it would be more clear if __bool__ would behave similarly, but since Array computes operations element-wise, it isn't possible.

wodenokoto 1991 days ago

I think the complaint is that you are using `bool` as a `all()` call on your Arrays.

If you used `all()` in your implementation instead, you could be compatible with the idiomatic use of `bool(my_list)` and the _very_ common `if my_list:` structure could be used with Arrays too (like most people probably would expect from a "better list type")

Regardless, Pandas struggles with the same problem, so you are at least in good company :)

orf 1992 days ago

> all(map(func3, filter(func2, map(func1, zip(a, b)))))

You definitely wouldn’t do this in “traditional Python”. You’d use a comprehension of some kind, or even the walrus operator, which is quite possibly faster and more readable than several chained lambdas.

lauriat 1992 days ago

Fair enough, the example is a bit exaggerated. You could implement it with comprehensions

  all(func3(y) for y in (func1(x) for x in zip(a, b)) if func2(y))

It most likely is a bit faster, but I wouldn't say it's more readable.

aldanor 1992 days ago

Why the JS-like naming, weird method naming convetions with strange underscores, and capitalized module name? I can't remember a single commonly used Python library with naming this strange.

E.g.

    def removeByIndex(self, b):
        """ Removes the value at specified index or indices. """
        ...

    def removeByIndex_(self, b):
        """ Removes the value at specified index or indices in-place. """
        ...

If you were to follow typical naming conventions, these would be either

    def remove_by_index(self, b): ...
    def remove_by_index_inplace(self, b): ...

Or pandas-like:

    def remove_by_index(self, b, inplace=False): ...

Or, one more step, use explicit typing as well (which also makes it more clear that the method returns self), and give a better name to the method argument rather than 'b':

    def remove_by_index(
        self, 
        index: Union[int, Iterable[int]], 
        inplace: bool = False,
    ) -> 'Array': ...

Explicit type signatures in libraries like this make many things self-explanatory, like the one above.

goodside 1992 days ago

I know this is an early/experimental project, but the README could use more motivation before diving into basic usage. Asking someone to change their general-purpose containers is a big ask.

It looks like Array mostly consolidates functional features already available in standard libraries, and the main innovation is a redesigned swiss-army-knife API.

Good APIs are important, but my instinct is they aren’t this important. Using enhanced versions of built-in container types sounds nice, but do you really want to be keeping track of whether something is a normal list or an Array? Do you want to force people who read your code to learn this library to work with something as fundamental as lists? It’s not an impossible bar to clear (e.g. NumPy, Pandas, Dask, xarray) but it’s a high one.

lauriat 1992 days ago

Thanks for the feedback! Redesigned swiss-army-knife is well put.

I’m sure Array’s not for everyone, but for some, including me it’s a nifty tool. I don’t expect people to memorise all the features of the library - the aim was to name and document each feature clearly such that finding the right method would be easy with the help of an IDE.

RocketSyntax 1992 days ago

Thank you for improving things and sharing.

I use numpy & pandas, lists & dicts every day. I read your docs/github page, but can you help me see the value?

However, I do think there are lots of common tasks that need to be done with lists that should be methods rather than fancy footwork =)

For example: https://stackoverflow.com/questions/3462143/get-difference-b...

As you allude to w your zip loop: https://stackoverflow.com/questions/1919044/is-there-a-bette...

lauriat 1992 days ago

Thank you for taking the time to check it out!

Naturally if you're dealing with big arrays/tensors, numpy is the best choice for operating on sequences.

However, ndarrays have downsides for certain use cases - as ndarrays are fixed size, adding elements is very slow, also they don't support functional methods (or rather you have to create a new array every time you apply e.g. a map), and ndarrays of any other type than numbers doesn't really make sense.

Many of the methods are wrappers for built-ins, but I find the syntax of Arrays cleaner than the weirdness of the builtins.

For example, while applying an async "starmap" to an Array is just a method call, with built-in lists you would have go through the whole hassle of importing both ThreadPoolExecutor and starmap, creating an executor, scheduling the function, and finally converting the result back to a list.

lunixbochs 1992 days ago

asyncmap using a thread pool with more than one worker by default is a little silly. Unless you map to a C function, you're just spawning a bunch of threads to contend for the GIL anyway.

RocketSyntax 1992 days ago

ndarrays "create a new array every time you apply"

That resonates with me now that you explain that I can't do it.

I do like chaining things in pandas like `df.select_types("float").head(100).plot.hist()`

nerdponx 1992 days ago

Be careful though. Numpy and Pandas go through some trouble to make sure that the data inside the array is not actually copied. For instance, reshaping and slicing just return memory views. Pandas emits a somewhat-infamous warning about it that often confuses newbies.

lunixbochs 1992 days ago

As this doesn't use `__slots__`, every empty Array() will be 176 bytes vs the 56 bytes of [], and incur a dict allocation per array.

This is due to classes without `__slots__` gaining a `__dict__` attribute for dynamic attribute assignment.

Currently:

    >>> sys.getsizeof([])
    56

    >>> a = Array()
    >>> sys.getsizeof(a)
    72
    >>> sys.getsizeof(a.__dict__)
    104

with `__slots__ = []` in the Array class definition:

    >>> a = Array()
    >>> sys.getsizeof(a)
    56
    >>> sys.getsizeof(a.__dict__)
    AttributeError: 'Array' object has no attribute '__dict__'

njharman 1992 days ago

> all(map(func3, filter(func2, map(func1, zip(a, b))))

That is super readable to me. Working left to right or inside out. There is one, clear, balanced, familiar, consistently used punctuation to guide you, parens, if you need it but adds little noise if you dont.

The “bunch of functions taking and returning an iterator” is a great paradigm. So clean and flexiable, and powerfull. ESP combined with Python’s “many things are iterable” and is trivial to write your own iterator

faitswulff 1992 days ago

I come from Ruby, but it’s pretty unreadable to me. Not that I couldn’t, I just don’t want to. So I doubt that it’s objectively super easy to read.

Any sort of reading inside out, right to left is a barrier to easy reading. This is why people like pipes in functional languages, right? You just read it in one direction.

lifthrasiir 1992 days ago

I have used Python for decades (not so much nowadays, but still) and it is very unreadable for me. It's clear that it is a data pipeline but the input and filters are all in a wrong order, thus backtracking is required for reading. I have the same complaint about str.join.

asimjalis 1992 days ago

I find this really useful and plan to use it. Thanks for writing and sharing.

One use case for the chaining/FP style that I find particularly powerful is building out logic on the REPL. The chaining style allows me to incrementally grow my chain like a unix pipeline, see the results, use that to tweak the chain, until I finally have what I want.

This type of instantaneous feedback loop is both highly productive and also extremely fun.

lauriat 1992 days ago

cheers!

topper-123 1992 days ago

I'd like to have a chaining operator in Python, like R is getting. Then the example could be:

> a |> zip(b) |> map(func1) |> filter(func2) |> forall(func3)

The advantages would be that this would work with all lists/iterables, so no need to make a special types.

brundolf 1992 days ago

The hard part with this is it sort of requires currying once you have >1 arguments, or something equivalent. I suppose Python could carve out an implicit behavior where the first or last argument is what gets fed into, but that feels potentially confusing as the calling syntax is now "lying" to you. In JavaScript doing a proper currying style isn't too hard because of arrow-syntax, but using python's function definition syntax to make a curried function would be hideous (not to mention, the standard library isn't done that way). Maybe you could have a "curryify" higher-order function. Or, the final option would be to have an explicit "insert previous value here" syntax as a part of the pipeline syntax, which is something the JS proposal has played with. Makes things more verbose (|> double(#) instead of |> double), but is maximally flexible and minimally confusing.

In short: it's a lot more complicated than it seems, but I agree that this style makes this type of thing 1000x more readable.

housecarpenter 1989 days ago

Python does have a "partial" function which does currying:

  from functools import partial

  a |> partial(zip, b) |> partial(map, func1) |> partial(filter, func2) |> partial(forall, func3)

Obviously it's a bit more verbose than if the currying was done implicitly, but it's not too bad, I think. You could also import partial under a shorter name if you want.

partial does have an advantage over implicit currying in that you can use keyword arguments to neatly curry on a parameter other than the first, although this isn't properly utilized by Python because most of the built-in functions have place-based rather than keyword arguments. In languages with implicit currying you have to use anonymous function expressions or functions like flip (flip(f, x, y) = f(y, x)) to deal with this.

It might also be worth noting that |> doesn't essentially need to be an operator, it would just be syntactic sugar:

  def chain(x, *fs):
      y = x
      for f in fs:
          y = f(x)
      return y

  chain(a, partial(zip, b), partial(map, func1), partial(filter, func2), partial(forall, func3))

Obviously having it as an infix operator is nicer, and produces less parentheses.

topper-123 1992 days ago

Just letting it implicitly be the first parameter would be good enough IMO, and a nice symmetry to self` in methods. That'd be very simple, which would be a plus in my book.

Pandas allows the first param in a pipe to be a tuple[callable, str], where the second argument would signify the parameter location, e.g. `val |> (func, "param_name")` which gives some flexibility.

But yeah, if you open up to piping, there are a lot of possible choices to be made and easy to go overboard also IMO.

brian_herman 1992 days ago

Can you do the same thing with dicts and make it so d['non_existant_key'] does not create an exception?

basdftrewq 1992 days ago

    from collections import defaultdict

    d = defaultdict(int)

    d['non_existant_key']

st0le 1992 days ago

Not quite the same what OP asked for. This will create the key and assign value 0 to it.

lauriat 1992 days ago

You can already do that with

  d.get("non_existant_key", default)

notretarded 1992 days ago

Why would I use this over numpy?

lauriat 1992 days ago

If you're doing matrix multiplication or other math operations on fixed size sequences, you shouldn't.

If, however, you need the dynamic nature of the built-in list or functional methods with a touch of numpyness, you should give Array a spin.