The original is indeed terrible and the second version is a bit better. A lot better than either one, though, is splitting your logic into multiple lines and assigning a descriptive identifier to each step. Maybe even throw in some inline comments if you're particularly respectful of others' time.
As tempting as it is to do something super clever and cram a ton of functionality into a small number of lines or characters (it does feel good), it's just better to be a bit more verbose and write simple, obvious code. I feel like code should be read like a book, not a puzzle.
What I like about "cramming a ton of functionality into [a single expression]" is that it doesn't leak any intermediates to the rest of the block, and it doesn't allow for mutation. There's a single output exposed; you can't accidentally use the wrong value downstream. You could wrap it all in an inner function, I guess, but that seems like overkill unless you plan to reuse it.
Though to be fair, having explicit intermediate variables is idiomatic in Python, from what I've seen. It's one of my biggest pet-peeves about the language, but it's not without precedent.
This is exactly the main situation where I'll happily "get clever" with my code.
It's not being reused and one of the following is true...
I don't want to leave behind intermediary objects for whatever reason is relevant, or I feel its worth it to compress the logic to make it possible to use a language feature that requires an expression, like lambdas or list/dict comprehensions.
heights.zip(widths).map(to_area).filter(lambda area: area > 10).forall(lambda a: print("Area " + a)
#Verbose way
hw_zipped = zip(a,b)
areas = hw_zipped.map(to_inches)
big_areas = areas.filter(a: a > 10)
for a in big_areas:
print("Area " + a)
---
Which do you prefer? I would argue the right level of abstraction is the functional way in this example, and its often the case in my experience, especially in python where you don't often use a namespace to store these intermediary variables and you have can't rely on typing
I agree. My main problem is I don't want intermediary variables floating around. Especially something like "areas". If python localized variables to a blocked namespace, I wouldn't mind
In scala:
---
val widths = Seq(1,2,3)
val heights = Seq(4,5,6)
widths.zip(heights).foreach { case (w, h) => {
val area = w * h
if (area > 10) {
println(s"Area: ${area}")
}
You can do this without the walrus in a one liner as well, I believe:
[area for area in (to_area(x, y) for x, y in zip(h, w)) if area > 10]
or generally, you can take a multiline statement like the one you have and replace named value with its expression. Add some indentation and it's not too bad:
[area for area in
(to_area(x, y) for x, y in zip(h, w))
if area > 10]
You can abstract it out to a function but I think its overkill, even if you generalize to something like print_area_filter(heights, widths, value, cmp) or whatever
If its not in a function, your example may (or may not depending on length if either a or b have a length of zero) create a floating variable called area out there.
I agree, and yes, the line may be a bit excessive. The idea of Arrays is not just to cram a heap of functions to a single line. The readability (at least to me) is improved even with e.g. a single map
Working with data scientists, in practice, these identifiers are usually "arr1", "arr2", &c. I'd rather have method chaining. Often the intermediates are not meaningful.
I agree with you in general, people (especially data scientists) are bad at naming things.
It's probably the core skill of good programmers though, so it should be taught more. I don't think anyone sets out to use misleading names, but it's easy for name and code to diverge, and it's crippling to readability.
However, often when refactoring/updating such data scientist code (or even understanding), I need to break apart the long method chains, and this is much, much more annoying than dealing with crummy names.
At least I can print the values associated with the names, which is not easily possible in the really long method chain.
> As tempting as it is to do something super clever and cram a ton of functionality into a small number of lines or characters (it does feel good), it's just better to be a bit more verbose and write simple, obvious code.
I find fluent style often clearer as well as more terse than with superfluous intermediate variables. Verbosity isn't the same thing as clarity.
(But in Python, comprehensions/genexps are often clearer than either.)
You can split the a.b.c.d onto different lines and comment each, which is a decent middle ground sometimes (a\n.b\n.c\n.d). A problem, still, is exceptions and debugging. You get paged and see that something went wrong in that expression that does so many different things, and it’s much more frustrating to track down the bug. It makes step debugging trickier too. I’d love better error message/debugger support for that kind of programming.
I disagree with this. Splitting this simple pipeline into more variables makes stuff a lot less readable. Splitting it into variables would very clearly indicate to me the intermediate computations are used elsewhere. Which wouldn't be the case here.
Fair enough. Readability is subjective but I understand the sentiment. Constructing list comprehensions of such long chained expressions can be rather tedious and error prone, though (as your example shows).
Chaining has its own benefits. But I think this doesn't fit the definition of "Pythonic". Again, "Pythonic" is highly debatable. But, You can always break down big chain of operations, into smaller chain using good variable naming in-between.
Many operations are implemented as iterator in python on list, like filter, groupby.
Looking at your code, its looks like you're not doing lazy computation. (Correct me if I wrong). This could be huge performance impact, depending upon use case of list.
I understand the unpythonic nature of Arrays may startle some hardcore pythonistas, but ability to chain functions was one of the main reasons why I wrote the package as I find nested function calls ugly and sometimes rather hard to decipher.
Regarding the perfomance, Arrays aren't meant to be super high performing but rather a simple way to manipulate sequences. For the best performance you should go with generic python, toolz or other.
I am with you on this. Personally, I would rather continue using Toolz (https://github.com/pytoolz/toolz), and contribute additional helper/utility methods to that library.
The whole point of some things being functions versus methods is that they are generic rather than specialized. The generic iterator protocol is probably the best feature about the Python language, and it's both a damn shame and bad design to not use it.
If you really wanted to make an improvement over built in lists, the thing to do would be to implement some kind of fully lazy "query planning" engine, like what Apache Spark has. Every method call registers a new method to be applied with the query planner, but does not execute it. Execution only occurs when you explicitly request it. That way you can effectively compile in efficient but readable code that takes multiple passes over the data into efficient operations internally that only make one pass, or at least fewer passes. This also naturally lends itself to parallelization/concurrency.
Dask does the lazy evaluation and query planning thing on numpy arrays and pandas dataframes, and can execute in parallel. It mimics most of their native interfaces which makes it a pretty easy drop-in.
> But, You can always break down big chain of operations, into smaller chain using good variable naming in-between.
I don't think so. Very frequently the intermediate values represent nothing in particular and naming them simply results in visual noise.
I think this is comparable to SQL or LINQ statements. Consider what those would look like if you had to name every intermediate values instead of being able to filter and group on-the-fly.
Of course you can make a mess out of those too, by building huge unreadable expressions, but that's also an extreme, similar to naming every intermediate step.
If you know that "something" is a sequence, then is the idiomatic thing to do. The point is that any time you rely on truthiness of a value, you need to think (sometimes quite carefully) about what type of value you're dealing with.
For good reason - len(something) or alternatives might be expensive to compute, bool(something) is actually what you are trying to do and can be optimised depending on the container.
For the basic sequence types (list, tuple, and range), len is definitely not expensive to compute. For custom types, it will depend on your implementation of __len__ (but then, computation of bool(...) will also depend on your implementation of __bool__).
Yes, len() with the basic types are not expensive, and it can vary based on the container implementation, but that’s not really the point.
The reason you should use “if x” is the same reason you should use “if x not in y” rather than “if not x in y”. It better expresses the semantics of your operation with the side effect that it may be faster.
I think the complaint is that you are using `bool` as a `all()` call on your Arrays.
If you used `all()` in your implementation instead, you could be compatible with the idiomatic use of `bool(my_list)` and the _very_ common `if my_list:` structure could be used with Arrays too (like most people probably would expect from a "better list type")
Regardless, Pandas struggles with the same problem, so you are at least in good company :)
You definitely wouldn’t do this in “traditional Python”. You’d use a comprehension of some kind, or even the walrus operator, which is quite possibly faster and more readable than several chained lambdas.
Why the JS-like naming, weird method naming convetions with strange underscores, and capitalized module name? I can't remember a single commonly used Python library with naming this strange.
E.g.
def removeByIndex(self, b):
""" Removes the value at specified index or indices. """
...
def removeByIndex_(self, b):
""" Removes the value at specified index or indices in-place. """
...
If you were to follow typical naming conventions, these would be either
Or, one more step, use explicit typing as well (which also makes it more clear that the method returns self), and give a better name to the method argument rather than 'b':
I know this is an early/experimental project, but the README could use more motivation before diving into basic usage. Asking someone to change their general-purpose containers is a big ask.
It looks like Array mostly consolidates functional features already available in standard libraries, and the main innovation is a redesigned swiss-army-knife API.
Good APIs are important, but my instinct is they aren’t this important. Using enhanced versions of built-in container types sounds nice, but do you really want to be keeping track of whether something is a normal list or an Array? Do you want to force people who read your code to learn this library to work with something as fundamental as lists? It’s not an impossible bar to clear (e.g. NumPy, Pandas, Dask, xarray) but it’s a high one.
Thanks for the feedback! Redesigned swiss-army-knife is well put.
I’m sure Array’s not for everyone, but for some, including me it’s a nifty tool.
I don’t expect people to memorise all the features of the library - the aim was to name and document
each feature clearly such that finding the right method would be easy with the help of an IDE.
Naturally if you're dealing with big arrays/tensors, numpy is the best choice for operating on sequences.
However, ndarrays have downsides for certain use cases - as ndarrays are fixed size, adding elements is very slow, also they don't support functional methods (or rather you have to create a new array every time you apply e.g. a map), and ndarrays of any other type than numbers doesn't really make sense.
Many of the methods are wrappers for built-ins, but I find the syntax of Arrays cleaner than the weirdness of the builtins.
For example, while applying an async "starmap" to an Array is just a method call, with built-in lists you would have go through the whole hassle of importing both ThreadPoolExecutor and starmap, creating an executor, scheduling the function, and finally converting the result back to a list.
asyncmap using a thread pool with more than one worker by default is a little silly. Unless you map to a C function, you're just spawning a bunch of threads to contend for the GIL anyway.
Be careful though. Numpy and Pandas go through some trouble to make sure that the data inside the array is not actually copied. For instance, reshaping and slicing just return memory views. Pandas emits a somewhat-infamous warning about it that often confuses newbies.
That is super readable to me. Working left to right or inside out. There is one, clear, balanced, familiar, consistently used punctuation to guide you, parens, if you need it but adds little noise if you dont.
The “bunch of functions taking and returning an iterator” is a great paradigm. So clean and flexiable, and powerfull. ESP combined with Python’s “many things are iterable” and is trivial to write your own iterator
I come from Ruby, but it’s pretty unreadable to me. Not that I couldn’t, I just don’t want to. So I doubt that it’s objectively super easy to read.
Any sort of reading inside out, right to left is a barrier to easy reading. This is why people like pipes in functional languages, right? You just read it in one direction.
I have used Python for decades (not so much nowadays, but still) and it is very unreadable for me. It's clear that it is a data pipeline but the input and filters are all in a wrong order, thus backtracking is required for reading. I have the same complaint about str.join.
I find this really useful and plan to use it. Thanks for writing and sharing.
One use case for the chaining/FP style that I find particularly powerful is building out logic on the REPL. The chaining style allows me to incrementally grow my chain like a unix pipeline, see the results, use that to tweak the chain, until I finally have what I want.
This type of instantaneous feedback loop is both highly productive and also extremely fun.
The hard part with this is it sort of requires currying once you have >1 arguments, or something equivalent. I suppose Python could carve out an implicit behavior where the first or last argument is what gets fed into, but that feels potentially confusing as the calling syntax is now "lying" to you. In JavaScript doing a proper currying style isn't too hard because of arrow-syntax, but using python's function definition syntax to make a curried function would be hideous (not to mention, the standard library isn't done that way). Maybe you could have a "curryify" higher-order function. Or, the final option would be to have an explicit "insert previous value here" syntax as a part of the pipeline syntax, which is something the JS proposal has played with. Makes things more verbose (|> double(#) instead of |> double), but is maximally flexible and minimally confusing.
In short: it's a lot more complicated than it seems, but I agree that this style makes this type of thing 1000x more readable.
Python does have a "partial" function which does currying:
from functools import partial
a |> partial(zip, b) |> partial(map, func1) |> partial(filter, func2) |> partial(forall, func3)
Obviously it's a bit more verbose than if the currying was done implicitly, but it's not too bad, I think. You could also import partial under a shorter name if you want.
partial does have an advantage over implicit currying in that you can use keyword arguments to neatly curry on a parameter other than the first, although this isn't properly utilized by Python because most of the built-in functions have place-based rather than keyword arguments. In languages with implicit currying you have to use anonymous function expressions or functions like flip (flip(f, x, y) = f(y, x)) to deal with this.
It might also be worth noting that |> doesn't essentially need to be an operator, it would just be syntactic sugar:
def chain(x, *fs):
y = x
for f in fs:
y = f(x)
return y
chain(a, partial(zip, b), partial(map, func1), partial(filter, func2), partial(forall, func3))
Obviously having it as an infix operator is nicer, and produces less parentheses.
Just letting it implicitly be the first parameter would be good enough IMO, and a nice symmetry to self` in methods. That'd be very simple, which would be a plus in my book.
Pandas allows the first param in a pipe to be a tuple[callable, str], where the second argument would signify the parameter location, e.g. `val |> (func, "param_name")` which gives some flexibility.
But yeah, if you open up to piping, there are a lot of possible choices to be made and easy to go overboard also IMO.
> all(map(func3, filter(func2, map(func1, zip(a, b)))))
> a.zip(b).map(func1).filter(func2).forall(func3)
The original is indeed terrible and the second version is a bit better. A lot better than either one, though, is splitting your logic into multiple lines and assigning a descriptive identifier to each step. Maybe even throw in some inline comments if you're particularly respectful of others' time.
As tempting as it is to do something super clever and cram a ton of functionality into a small number of lines or characters (it does feel good), it's just better to be a bit more verbose and write simple, obvious code. I feel like code should be read like a book, not a puzzle.