Hacker News new | ask | show | jobs
by el_oni 1058 days ago
I found explorer quite frustrating. I've used polars in python and loved it, but I brought in some financial data and couldn't strip off "£" from the start of a string so I could go on to cast it to a number.

As far as I could tell I would have to bring that data into elixir, do the text processing and put it back into explorer, which to me defeated the whole point of a dataframe library.

I imagine it's good for precleaned data, using it with the built in datasets has been fine

3 comments

José from the Nx/Explorer team here.

Feel free to open up an issue! We have been focused more on high-level features (such as integration with S3, Postgres, Snowflake, SQLite, etc) and therefore we are missing many functions that already exist on Polars. Good news is that it is very quick to add them, so just let us know. :)

Thanks José!

Yeah I noticed that, it looks like the numeric manipulations are well represented, less so for strings.

I'll dig out the the code later on and get an issue raised.

Seems that for that case you would need to use `mutate_with`, which as you say does pass data between systems.

My understanding is you can use any Series operations without penalty (ie. they get passed into some rust NIF call), https://hexdocs.pm/explorer/Explorer.Series.html#functions-s..., which does include whitespace trimming, but not arbitrary strings, but I imagine it wouldn't be too much of a jump to add arbitrary constant strings. Might just need to expose `str.slice` or `str.replace`.

The docs do imply that `mutate_with` operates lazily, so you only pay the transfer cost once per row, no matter how many mutations you're applying, but whether that's performant enough depends would be case by case.

mutate_with is still lazy and therefore you can't transform it. If you need to use transform, then you need to do:

    new_column = Series.transform(df["column"], fn arg -> ... end)
    DF.put(df, "column", new_column)
which _is annoying_ since you are not supposed to use it. The correct way is to extend the Series API, which we will be very happy to!
You can use str.slice or str.extract to clean the data:

https://pola-rs.github.io/polars/py-polars/html/reference/ex...

Yes for polars in python. This was an issue I had with explorer in elixir. There is no elixir binding to the str.slice or str.extract functions yet