Hacker News new | ask | show | jobs
by Matumio 3137 days ago
For comparison, I just implemented the same as C SWIG extension[1]. It's about 10% faster, but it's cheating by comparing bytes instead of utf-8 encoded characters. The more interesting part to me is the comparison of the amount of boilerplate code required.

https://github.com/martinxyz/rust-python-example/commit/f8e3...

6 comments

One thing though that gets very complicated about using SWIG is ownership semantics. With anything more complicated than passing scalar values, it is very easy to introduce a memory leak or double-free if you don't get the flags right. I wonder if Rust types naturally allow a much better inference of ownership semantics across the language boundary?
If you try to wrap any non-trivial type using SWIG typemaps you will quickly go insane. However to speed up an inner loop you can often get away with a few PyObject* arguments/returns. SWIG will pass those through and you can use the Python/C API directly, e.g. to return a numpy array. Allow SWIG to handle only simple types. The Python/C API is relatively sane, but you'll have to learn the reference counting conventions.
Agreed! On my current project (C++) I found that things get extremely complicated with shared_ptrs and directors. I even ended up contributing some solutions to SWIG.

It all appears to be due to a lack of semantics in the C header. SWIG depends on specifying this stuff in the interface file, but I've often wondered if it wouldn't be better to enhance the C-side, either by standard parameter name conventions or by some Doxygen-like standard comments to indicate ownership and other stuff.

SWIG has this nice potential to generate wrappers for (m)any language(s), but in practice as you said it's often just easier to use the Python API directly instead of trying to make it too general. Shame.

I have a C++ project that is available in multiple languages (Python, Rust, Fortran, Julia and JS), and I decided not to use SWIG partly because of this issue. Instead, I manually maintain a clean C API with everything that I need, and manually wrap this API using whatever is available in other languages. It is a bit more work (and thus incentivize me not to break the API ^^), but allow me to mix and match various ownership semantics throughout the API.
> It's about 10% faster, but it's cheating by comparing bytes instead of utf-8 encoded characters.

I'm really glad that you acknowledged this — I work with a lot of non-ASCII text and have run into that more than a few times in real code.

I find pybind11 [1] to be perfect for my C++ code. There's so little boilerplate, and I get RAII-guaranteed memory safety and all the speed my C++ development can bring.

For example, the binding of an accelerated HyperLogLog implementation only requires tiny amount of work, plus a line in my Makefile:

  PYBIND11_MODULE(_hll, m) {
      m.doc() = "pybind11-powered HyperLogLog"; // optional module docstring
      py::class_<hll_t> (m, "hll")
          .def(py::init<size_t>())
          .def("clear", &hll_t::clear, "Clear all entries.")
          .def("resize", &hll_t::resize, "Change old size to a new size.")
          .def("sum", &hll_t::sum, "Add up results.")
          .def("report", &hll_t::report, "Emit estimated cardinality. Performs sum if not performed, but sum must be recalculated if further entries are added.")
          .def("add", &hll_t::add, "Add a (hashed) value to the sketch.")
          .def("addh_", &hll_t::addh, "Hash an integer value and then add that to the sketch.");
  }
[1] https://github.com/pybind/pybind11
If working in C++ land, I'd agree this is the nicest approach. It does, however, require linking against a specific libpython version [1], unlike Milksnake. But I'm not sure that's a bad thing...

[1] http://pybind11.readthedocs.io/en/master/basics.html#creatin...

True. It's not that bad, though -- "python3-config --extension-suffix" or "python-config --extension-suffix" is all it takes to generate the suffix you want, and you can drop it straight in your site-packages folder. At that point, you're just dropping the 3 or not depending on your version.

It's not as simple as milksnake. I would like to see some smarter extensions added to pybind11, but I'm okay with that for now.

I'm not familiar with Rust libraries, but I would guess it's just counting code points and not characters, so strictly speaking both are cheating.

I would love to see people showing how to do simple string processing, like counting characters in proper grapheme cluster level in their favorite programming language.

   for (c1, c2) in val.chars().zip(val.chars().skip(1))
chars() iterates by unicode scalar values. It'd be bytes() for bytes.

If you wanted to do it by grapheme clusters, you'd add https://crates.io/crates/unicode-segmentation to your Cargo.toml, add the relevant imports you see on that page to your code, and change the above line to

   for (c1, c2) in UnicodeSegmentation::graphemes(val, true).zip(UnicodeSegmentation::graphemes(val, true).skip(1))
... possibly splitting that up into variables becuase dang, that's a long line.

Then, you're getting &strs instead of chars for the iteration, but I think the body still says the same, as == checks by value.

Another possibility for this kind of examples that I used and has the minimum boiler pate is to just generate a .so file with your native function in C and call it from ctypes. Basically almost no boilerplate at all.
> with your native function in C

Works well this way with Rust too.

pytest.benchmark really needs to default to a smaller width of it's stats, those stats are really just meant to be used in a terminal..