Hacker News new | ask | show | jobs
by FreeHugs 1175 days ago
The most important part of the article seems to be that this Python code is taking "an avg of 293.41ms per iteration":

        def find_close_polygons(
            polygon_subset: List[Polygon], point: np.array, max_dist: float
        ) -> List[Polygon]:
            close_polygons = []
            for poly in polygon_subset:
                if np.linalg.norm(poly.center - point) < max_dist:
                    close_polygons.append(poly)

            return close_polygons
And after replacing it with this Rust code, it is taking "an avg of 23.44ms per iteration":

        use pyo3::prelude::*;
        use ndarray_linalg::Norm;
        use numpy::PyReadonlyArray1;

        #[pyfunction]
        fn find_close_polygons(
            py: Python<'_>,
            polygons: Vec<PyObject>,
            point: PyReadonlyArray1<f64>,
            max_dist: f64,
        ) -> PyResult<Vec<PyObject>> {
            let mut close_polygons = vec![];
            let point = point.as_array();
            for poly in polygons {
                let center = poly
                    .getattr(py, "center")?
                    .extract::<PyReadonlyArray1<f64>>(py)?
                    .as_array()
                    .to_owned();

                if (center - point).norm() < max_dist {
                    close_polygons.push(poly)
                }
            }

            Ok(close_polygons)
        }
Why is the Rust version 13x faster than the Python version?
6 comments

Yeah but the Python code is so bad that it's easy to get a 10x speedup using only numpy, as well. The current code essentially does:

    import numpy as np

    n_sides = 30
    n_polygons = 10000

    class Polygon:
        def __init__(self, x, y):
            self.x = x
            self.y = y
            self.center = np.array([self.x, self.y]).mean(axis=1)


    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        close_polygons = []
        for poly in polygon_subset:
            if np.linalg.norm(poly.center - point) < max_dist:
                close_polygons.append(poly)

        return close_polygons

    polygons = [Polygon(*np.random.rand(2, n_sides)) for _ in range(n_polygons)]
    point = np.array([0, 0])
    max_dist = 0.5

    %timeit find_close_polygons(polygons, point, max_dist)
(I've made up number of sides and number of polygons to get to the same order of magnitude of runtime; also I've pre-computed centers, as they are cached anyway in their code), which on my machine takes about 40ms to run. If we just change the function to:

    def find_close_polygons(
        polygon_subset: List[Polygon], point: np.array, max_dist: float
    ) -> List[Polygon]:
        centers = np.array([polygon.center for polygon in polygon_subset])
        mask = np.linalg.norm(centers - point[None], axis=1) < max_dist
        return [
            polygon
            for polygon, is_pass in zip(polygon_subset, mask)
            if is_pass
        ]
then the same computation takes 4ms on my machine.

Doing a Python loop of numpy operations is a _bad_ idea... The new code hardly even takes more space than the original one.

(as someone else mentioned in the comments, you can also directly use the sum of the squares rather than `np.linalg.norm` to avoid taking square roots and save a few microseconds more, but well, we're not in that level of optimization here)

Python's for loop implementation is slow, also. You can use built in utils like map() which are "native" and can be a lot faster than a for loop with a push:

https://levelup.gitconnected.com/python-performance-showdown...

Nope. Map() is same speed as for loop.

Benchmarking methodology in the link is not good. Author should use timeit() or cProfiler or so. 0.01s of difference is mostly due to fluctuation. The order of execution also matters. Say you want to test A and B function, you need actually to run A, B, B, A to see if the ordering brings the different.

Yeah I guess this isn't true anymore, it looks like maybe it was true in 2.6 days.
I immediately verified both claims.

list(map(func, arr)) did bring 10% benefits if the func is builtin e.g. int(), str().

But if func is tuple(), list(), set() or any kind of user defined function, list(map()) is always slower.

You can try yourself to see list(map()) is not working well:

    import numpy as np
    a = np.arrange(100000, 100000)
    %%timeit
    b1 = [np.sum(x) for x in a]
    # repeat once
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    import gc
    gc.collect()
    %%timeit
    b2 = list(map(np.sum, a))
    # repeat once
    b1 = [np.sum(x) for x in a]
    # repeat once
I guess that's why I only use map() if and only if is it the case 'list(map(itemgetter, arr))', because generally there is no benefit to use it.
thanks!
I don't think it's the loop implementation. The stuff in the loop should take multiple orders of magnitude more time than the loop itself:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)
I don’t know if numpy fixed this, but it used to be that mixing Python numbers with numpy in a tight loop is horribly slow. Try hoisting max_dist out of the loop and replacing it with max_dist_np that converts it to a numpy float once.
Speaking of this, I once find that

    for x in numpy.array: 
is 9X slower than

    for x in numpy.array.tolist():
in 2021.
Its not the looping itself that is slow in the article you linked, its that every element is appended to the list. If you use a list comprehension its even faster and it still loops over all elements of the list.
Here is the decompilation of the listcomp

    [x for x in range(5)]
:

    RESUME 0
    BUILD_LIST
    LOAD_FAST
    FOR_ITER 4
    STORE_FAST (x)
    LOAD_FAST (x)
    LIST_APPEND
    JUMP_BACKWARDS 5
    RETURN_VALUE
As you can see from the third last instruction, a listcomp does append individual elements to the list. What it doesn’t need to do is call a method to do so (let alone lookup the corresponding method).
No, AFAIK each for loop iteration appends and pops the stack in the interpreter, while map loops all entirely in the native implementation of the interpreter itself.
I was surprised that the Rust version is _only_ 13x as fast as the Python version.
Probably because it wasn't pure Python to start with.
Shouldn't `close_polygons` be presized in both Python and Rust to avoid repeated allocation and copy ?
One carries the entire feature set of the python runtime, the other is compiled.
The time is spent in this 3-line loop:

    for poly in polygon_subset:
        if np.linalg.norm(poly.center - point) < max_dist:
            close_polygons.append(poly)
I don't think the entire feature set of the Python runtime is involved in this.
Without using every feature you still have to conform to the complexity of the runtime. Every variable in that loop is a hash map lookup into the locals. `np.linalg.norm` is two field accesses, necessitating more hash map lookups on the module objects. `-` and `<` are attribute lookups as well as full function calls.
> Every variable in that loop is a hash map lookup into the locals.

No, it's a LOAD_FAST bytecode instruction. (The other stuff is mostly right, and probably contributes.)

The final code takes just 2.90ms per iteration.
The rest is not a fair comparison, because it rewrites the used libraries, not the application code.

You can always speed up an application if you rewrite the used libraries to match your specific use case.

It's a fair comparison if the purpose is to guide people in fixing performance issues in their python code.

"That Rust library will be faster than the corresponding python library" is a useful thing to know here.

Usually not by 10x though, unless the original implementation involved some really bad decisions.
The Rust code is still only brute force - using suitable persistent acceleration structures you can probably get a 10x again or maybe even 100x, in 2D a kd-tree is really fast for NNs.

So much faster that the allocations for the result will probably be the bottleneck.