Hacker News new | ask | show | jobs
by KapKap55 1256 days ago
The 100x plus speedup is no exaggeration either, I've noticed some incredulity from others at the numbers.

For a very simple comparison, I recently was testing a (poorly) custom built data-oriented Entity-Component-System for usage in games with a more typical "componentized" object approach. No multithreading or anything complicated.

On my system, the typical approach could generate about 1000 new objects and attach a single component in about 1 millisecond.

The data-oriented approach could generate about 100,000 new "objects" and attach a single component in about 0.5 milliseconds.

Same thing in the end, but one is roughly 200x faster in the same time frame. It's pretty stunning when you see stuff like this in benchmarks.

2 comments

    -----s.-ms.-us.-ns|----------------------------------------------------------
                     0.1 ns - NOP
                     0.3 ns - XOR, ADD, SUB
                     0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                     0.9 ns - JMP SHORT
                     1   ns - speed-of-light
        ?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )
                   3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                     5   ns - CPU L1 iCACHE Branch mispredict
                     7   ns - CPU L2  CACHE reference
                    10   ns - DIV
                    19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
                    71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
                   100   ns - MUTEX lock/unlock
                   100   ns - own DDR MEMORY reference
                   135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
                   202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
                   325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
        |Q>~~~~~ 5,000   ns - QPU on-chip QUBO ( quantum annealer minimiser 1 Qop )
                10,000   ns - Compress 1K bytes with a Zippy PROCESS
                20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
               250,000   ns - Read   1 MB sequentially from  MEMORY
               500,000   ns - Round trip within a same DataCenter
        ?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~
            10,000,000   ns - DISK seek
            10,000,000   ns - Read   1 MB sequentially from  NETWORK
        ?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~
            30,000,000   ns - Read 1 MB sequentially from a  DISK
           150,000,000   ns - Send a NETWORK packet CA -> Netherlands
        1s:   |   |   |
          .   |   | ns|
          .   | us|
          . ms|

(https://stackoverflow.com/a/33065382)

However,

    0.001 ns light transfer in Gemmatimonas phototrophica bacteriae
biology has much more performant/optimized machines, therefore, yes, plenty of room for improvement in silico.
The problem with the "plenty of space down there" remark is that practical computers have to have their input states programmable, ie., there must exist some causal deterministic process to set the state of the input.

It's not clear that organic solutions at that level can do programmable computational work, nor that their work is at all deterministic.

At best, it would seem the organic direction for computing will be about building robots rather than CPUs.

Oh, but the "organic" solutions do highly deterministic, extremely programmable computational work: 99.99999+% of newborns have 2 hands, 2 legs, and 1 head, and they all started development from a single cell [1]. It's just that the "organic" solutions are written in a 4+ billion year-old highly redundant, distributed, resilient, evolved language whereas our CPUs are not on the same phylogenetic tree.

The quotation marks around organic are just there to point out that there is something wrong with the dichotomy organic (various pro/eu-karyotes from bacteria to humans)/inorganic (from thermostats to CPUs).

[1] Michael Levin: Anatomical decision-making by cellular collectives https://www.youtube.com/watch?v=Z-9rLlFgcm0

> extremely programmable computational work: 99.99999+% of newborns have 2 hands, 2 legs, and 1 head, and they all started development from a single cell [1].

Just on the risks of early miscarriage from wrong number of chromosomes I'd say your numbers are way off.

> Miscarriage is the most common complication of early pregnancy.[21] Among women who know they are pregnant, the miscarriage rate is roughly 10% to 20%, while rates among all fertilisation is around 30% to 50%.

https://en.m.wikipedia.org/wiki/Miscarriage

So 30-50% failure rate.

Don't forget that the infant mortality rate (post birth) is 0.5%.

https://www.cdc.gov/nchs/fastats/infant-health.htm

Number of infant deaths: 19,582

Deaths per 100,000 live births: 541.9

Leading causes of infant deaths:

– Congenital malformations, deformations and chromosomal abnormalities

– Disorders related to short gestation and low birthweight: not elsewhere classified

– Sudden infant death syndrome

You are nitpicking, nevertheless, newborn, noun, a baby that was born recently [1], hence the 99.99999+% figure is applied for the full term pregnancies, once the fetus is decoupled from the mother and has been born as a, well, newborn. And furthermore, the point is not that they live or die, but that they have 2 hands, 2 legs, and 1 head after developing from one single cell through deterministic computation in the morphospace.

[1] https://dictionary.cambridge.org/dictionary/english/newborn#...

I'm not nitpicking - I'm saying the reason most babies are born with 2 legs hands and a head is because genetic defects die off before birth (plus the screening we have for early termination nowadays) - and the failure rate starting from a single cell is huge.
> 99.99999+% of newborns have 2 hands, 2 legs, and 1 head

This number is far too high. The rate of conjoined twins (violating "1 head") is about 1 in 50,000 [1], and the rate of "limb reduction defects" (violating "2 hands and 2 legs") is about 1 in 1,900 [2].

Those correspond to 99.998% and 99.94% respectively. 3-4 nines is still impressive for such a complex system, but let's not claim it's 7+ nines.

[1] https://www.chop.edu/conditions-diseases/conjoined-twins [2] https://www.cdc.gov/ncbddd/birthdefects/ul-limbreductiondefe...

"The occurrence of conjoined twins is rare. Its actual prevalence is unknown, but it is estimated to range from 1:50,000 to 1:200,000" [1]. 1 in 200,000 would raise it to 99.9995%. But as pointed again and again in the other comments, the pointless, hyperbolic figure is irrelevant. When cutting the planarian worm head, the regeneration is always, 100% a head, if no change in the bioelectrical gradients. The argument was about the deterministic computation done by biology in the morphospace.

[1] Importance of Angiographic Study in Preoperative Planning of Conjoined Twins Case Report, https://www.sciencedirect.com/science/article/pii/S180759322...

> pointless, hyperbolic figure is irrelevant

Then why not simply give the correct, still impressive, figure, as I suggested?

> the regeneration is always, 100% a head, if no change in the bioelectrical gradients

This is also a meaningless statement. It's correct 100% of the time, except when something goes wrong and it's not.

Can you quantify the likelihood of something going wrong with the "bioelectrical gradient"? I'm not familiar with this organism but I suspect it's several nines, but less than 7.

In general, probabilities less than a certain amount stop being meaningful, because it's more likely that the model used generate the probability fails to reflect reality. See https://www.lesswrong.com/posts/AJ9dX59QXokZb35fk/when-not-t...

I don't get this - you're talking about generating objects - I'm assuming you're talking allocating new objects. What's your bottleneck ? Allocator or do you have some poor data structure to update it ? Where is the speedup coming from ? The kind of speedup you're talking about sounds like moving from allocating each object separately to some arena and avoiding allocator overhead.

I could be wrong in the assumptions - but OP is talking about fitting stuff in CPU cache don't really see how that translates to your scenario.