Hacker News new | ask | show | jobs
by jiggawatts 1860 days ago
> 126msec for 100k cycles.

Or to put it another way: 1,200 nanoseconds. That's about 3,000-5,000 instructions on a modern CPU. Believe it or not, that's actually pretty bad.

After jumping through some hoops to ensure that rustc doesn't just compile the whole thing down to a constant, I benchmarked my version as taking 15-20 nanoseconds per iteration. About 45-80 instructions!

I actually couldn't quite believe it myself, so I jumped through more hoops to ensure that it wasn't being optimised away, wasn't getting inlined too aggressively, etc... No change.

Ran it through Godbolt to inspect the assembly, and then I realised that, yes, modern languages, compilers, and CPUs really are this good!

Think about it: For the specific input example with 4 strings the algorithm boils down to: compare 7 bytes with 7 bytes, replace a pointer with another pointer, then compare 2 bytes with 2 bytes three times. That's about a hundred assembly instructions, or thereabouts.

Godbolt output:

    push   rbp
    push   r15
    push   r14
    push   r13
    push   r12
    push   rbx
    push   rax
    mov    r12,rdi
    mov    ebx,0x8
    xor    ebp,ebp
    lea    r14,[rip+0x3cb3c]        # 444e8 <anon.6527da4acb4810bb73692fa85a2e25ef.0.llvm.5506334730328533235+0x20>
    mov    r15,QWORD PTR [rip+0x3f3b5]        # 46d68 <bcmp@GLIBC_2.2.5>
    cs nop WORD PTR [rax+rax*1+0x0]
    nop    DWORD PTR [rax]
    cmp    rbp,0x2
    je     79f2 <example::testabc+0x62>
    mov    r13,rbp
    mov    rdx,QWORD PTR [rbx+r14*1]
    cmp    rdx,QWORD PTR [r12+rbx*1]
    jne    79ec <example::testabc+0x5c>
    lea    rbp,[r13+0x1]
    mov    rsi,QWORD PTR [r12+rbx*1-0x8]
    mov    rdi,QWORD PTR [rbx+r14*1-0x8]
    call   r15
    add    rbx,0x10
    test   eax,eax
    je     79c0 <example::testabc+0x30>
    cmp    r13,0x2
    jb     7a07 <example::testabc+0x77>
    lea    rax,[rip+0x3060e]        # 38007 <_fini+0xd13>
    mov    QWORD PTR [r12+0x10],rax
    mov    QWORD PTR [r12+0x18],0x1
    xor    ebx,ebx
    xor    ebp,ebp
    nop    DWORD PTR [rax+rax*1+0x0]
    cmp    rbp,0x2
    je     7a43 <example::testabc+0xb3>
    mov    r13,rbp
    mov    rdx,QWORD PTR [rbx+r14*1+0x8]
    cmp    rdx,QWORD PTR [r12+rbx*1+0x18]
    jne    7a3d <example::testabc+0xad>
    lea    rbp,[r13+0x1]
    mov    rsi,QWORD PTR [r12+rbx*1+0x10]
    mov    rdi,QWORD PTR [rbx+r14*1]
    call   r15
    add    rbx,0x10
    test   eax,eax
    je     7a10 <example::testabc+0x80>
    cmp    r13,0x2
    jb     7a58 <example::testabc+0xc8>
    lea    rax,[rip+0x305bd]        # 38007 <_fini+0xd13>
    mov    QWORD PTR [r12+0x20],rax
    mov    QWORD PTR [r12+0x28],0x1
    mov    ebx,0x8
    xor    ebp,ebp
    nop
    cmp    rbp,0x2
    je     7a93 <example::testabc+0x103>
    mov    r13,rbp
    mov    rdx,QWORD PTR [rbx+r14*1]
    cmp    rdx,QWORD PTR [r12+rbx*1+0x20]
    jne    7a8d <example::testabc+0xfd>
    lea    rbp,[r13+0x1]
    mov    rsi,QWORD PTR [r12+rbx*1+0x18]
    mov    rdi,QWORD PTR [rbx+r14*1-0x8]
    call   r15
    add    rbx,0x10
    test   eax,eax
    je     7a60 <example::testabc+0xd0>
    cmp    r13,0x2
    jb     7aa8 <example::testabc+0x118>
    lea    rax,[rip+0x3056d]        # 38007 <_fini+0xd13>
    mov    QWORD PTR [r12+0x30],rax
    mov    QWORD PTR [r12+0x38],0x1
    add    rsp,0x8
    pop    rbx
    pop    r12
    pop    r13
    pop    r14
    pop    r15
    pop    rbp
    ret    
    nop    WORD PTR [rax+rax*1+0x0]
Rust? It simply impresses.
1 comments

I'm not sure getting the wrong answer fast is something to be proud of.

Your rust code has a bug in it, is longer, and you spent more time on it.

Please elaborate on that bug. The rust function I provided replaces "select, *" correctly with "select, _". Is that not what it was supposed to do? I checked that it works with empty arrays, arrays with one entry, etc...

Thinking about it, this version boils down the logic to its barest essence:

    fn test(input: &mut [&str]) {
    
        let mut was_select = false;
        for i in input.iter_mut() {
            if was_select && *i == "*" { *i = "_"; was_select = false; }
            else { was_select = *i == "select"; }
        }
    }
It runs in 5.5 nanoseconds per iteration, which is just absurdly fast. That's about 200x faster than your K version!

What I like about this Rust version is that it reflects the approach that for a computer processor is optimal, yet it is high-level and readable. I could convert that "test" function easily enough to a generic "replace" function similar to a string search & replace, but one that can operate on any mutable iterator with the maximum possible efficiency, not just arrays.

Then, your K code that requires "careful reading" to slowly tease apart the meaning could be converted to a form that is practically prose:

    let mut input = ["select", "*", "bar", "potato"];
    replace( &mut input, &["select", "*"], &["select", "_"] );*
> The rust function I provided replaces "select, *" correctly with "select, _". Is that not what it was supposed to do?

No. It is not supposed to modify its input but return a copy.

Granted, if replacing arbitrary sequences with arbitrary sequences, then copying is necessary, as this could result in the output length increasing.

However, for replacing scalars only, the in-place mutable version is in some sense superior: it doesn't force a memory allocation. Moreover it can be trivially converted into a copying version by simply wrapping it in a function that first copies the input, and then mutates it in-place. The reverse is not true: the copying version cannot be wrapped to create a non-allocating version.

To be honest, I wish more languages put this kind of effort into their standard libraries, but most have about 10% of what I would like to see. For example, this kind of "string matching" is really "sequence matching" and ought to be fully generic for any underlying comparable and copyable type, not just arrays of characters. String search algorithms like Boyer-Moore ought to be directly applicable to arrays of integers or enums, e.g.: for recognition of code patterns in lists of parsed tokens.

At least one person has done something like this for Rust: https://github.com/peterjoel/rust-iter-replace/blob/6c575eeb...

Similarly, there's the new InPlaceIterable, which is interesting but not quite enough to suit my taste: https://doc.rust-lang.org/std/iter/trait.InPlaceIterable.htm...

> However, for replacing scalars only, the in-place mutable version is in some sense superior

I mean, we are talking about someone else's code and someone else's decisions. If we can change the rules, you're absolutely right we can do much much better.

But beware microbenchmarking too much: Giving the treatment you gave your rust to your entire program can be more than exhausting, it can actually end you up with a slower program simply because your program gets too big!

k makes a lot of compromises to stay small enough to keep both the interpreter and the application in L1, but whole-program speeds benefit from this treatment sometimes by factors of 1000x or more, and that's hard to show with these microbenchmarks as well.

> To be honest, I wish more languages put this kind of effort into their standard libraries, but most have about 10% of what I would like to see. For example, this kind of "string matching" is really "sequence matching" and ought to be fully generic for any underlying comparable and copyable type, not just arrays of characters

In APL, this is called ⍷ (pronounced "find") and sometimes even APL-ers momentarily forget it exists[1], but in k you always have to make it yourself, usually (as we did today) with ⍸ (where) and ≡ (match), but sometimes some other way[2], and this works on all the different data types k supports (including integers, enums, dates, times, symbols) and across multiple cores as well. You are right to predict it would be useful: It is useful :)

[1]: https://news.ycombinator.com/item?id=27229994

[2]: https://news.ycombinator.com/item?id=16851862