Hacker News new | ask | show | jobs
by rasz 207 days ago
At the very moment person in charge says "ok this works, now make it not slow". Python is modern age BASIC. Easy to write and good for prototypes, scripting, gluing together libraries, fast iterations. If you want performance and heavy data processing anything else will be better. PHP, Java, even JavaScript.

For example Python is struggling to reach real time performance decoding RLL/MFM data off of ancient 40 year old hard drives (https://github.com/raszpl/sigrok-disk). 4GHz CPU and I cant break 500KB/s in a simple loop:

    for i in range(len(data)):
      decoder.shift = ((decoder.shift << data[i]) + 1) & 0xffffffffff
      decoder.shift_index += data[i]
      if decoder.shift_index >= 16:
       decoder.shift_index -= 16
       decoder.shift_byte = (decoder.shift >> decoder.shift_index) & 0x5555
       decoder.shift_byte = (decoder.shift_byte + (decoder.shift_byte >> 1)) & 0x3333
       decoder.shift_byte = (decoder.shift_byte + (decoder.shift_byte >> 2)) & 0x0F0F
       decoder.shift_byte = (decoder.shift_byte + (decoder.shift_byte >> 4)) & 0x00FF
1 comments

To optimize that code snippet, use temporary variables instead of member lookups to avoid slow getattr and setattr calls. It still won’t beat a compiled language, number crunching is the worst sport for Python.
Which is why in Python in practice you pay the cost of moving your data to a native module (numpy/pandas/polars) and do all your number crunching over there and then pull the result back.

Not saying it's ideal but it's a solved problem and Python is eating good in terms of quality dataframe libraries.

All those class variables are already in __slots__ so in theory it shouldnt matter. Your advice is good

     self.shift_index -= 16
     shift_byte = (self.shift >> self.shift_index) & 0x5555
     shift_byte = (shift_byte + (shift_byte >> 1)) & 0x3333
     shift_byte = (shift_byte + (shift_byte >> 2)) & 0x0F0F
     self.shift_byte = (shift_byte + (shift_byte >> 4)) & 0x00FF


 but only for exactly 2-4 milliseconds per 1 million pulses :) Declaring local variable in a tight loop forces Python into a cycle of memory allocations and garbage collection negative potential gains :(

    SWAR                           :     0.288 seconds  ->    0.33 MiB/s
    SWAR local                     :     0.284 seconds  ->    0.33 MiB/s
This whole snipped is maybe what 50-100 x86 opcodes? Native code runs at >100MB/s while Python 3.14 struggles around 300KB/s. Python 3.4 (Sigrok hardcoded requirement) is even worse:

    SWAR                           :     0.691 seconds  ->    0.14 MiB/s
    SWAR local                     :     0.648 seconds  ->    0.14 MiB/s
You can try your luck https://github.com/raszpl/sigrok-disk/tree/main/benchmarks I will appreciate Pull requests if anyone manages to speed this up. I give up at ~2 seconds per one RLL HDD track.

This is what I get right now decoding single tracks on i7-4790 platform:

    fdd_fm.sr 0.9385 seconds
    fdd_mfm.sr 1.4774 seconds
    fdd_fm.sr 0.8711 seconds
    fdd_mfm.sr 1.2547 seconds
    hdd_mfm_RQDX3.sr 1.9737 seconds
    hdd_mfm_RQDX3.sr 1.9749 seconds
    hdd_mfm_AMS1100M4.sr 1.4681 seconds
    hdd_mfm_WD1003V-MM2.sr 1.8142 seconds
    hdd_mfm_WD1003V-MM2_int.sr 1.8067 seconds
    hdd_mfm_EV346.sr 1.8215 seconds
    hdd_rll_ST21R.sr 1.9353 seconds
    hdd_rll_WD1003V-SR1.sr 2.1984 seconds
    hdd_rll_WD1003V-SR1.sr 2.2085 seconds
    hdd_rll_WD1003V-SR1.sr 2.2186 seconds
    hdd_rll_WD1003V-SR1.sr 2.1830 seconds
    hdd_rll_WD1003V-SR1.sr 2.2213 seconds
    HDD_11tracks.sr 17.4245 seconds <- 11 tracks, 6 RLL + 5 MFM interpreted as RLL
    HDD_11tracks.sr 12.3864 seconds <- 11 tracks, 6 RLL + 5 MFM interpreted as MFM