Hacker News new | ask | show | jobs
by data-cat 4573 days ago
This makes me wonder when 128 bit processing will come around.
4 comments

We already have 128-bit registers for vector computations (for example, the SSE registers in x86).

As for addressing 128 bits of memory: that's more than a century off even if memory continues to double every two years (which doesn't seem likely to begin with). It's actually plausible that the step to 64-bit addressing was the last one, ever.

Don't know why you were downvoted. Modern x86 chips not only have 128-bit vector registers, but also 256-bit ones: http://en.wikipedia.org/wiki/Advanced_Vector_Extensions

(And while one could argue that a vector register isn't a "real" general-purpose register, the primary difference is that you can't perform fullwidth arithmetic, which is of very little use > 64 bits anyway.)

64-bit ARM means passing around 64-bit pointers by default.

In most code settings, this pollutes the cache (you can only hold half as many pointers), and leads to slightly weaker performance.

64-bit computing is NOT an advantage in the phone world. It is a massive advantage for database applications or large web services... but certainly not for phone apps in the near future.

I know what 64-bit computing is. And I didn't say anything about phone chips, or ARM. It seems like you're posting this rant to every subthread, regardless of content?
That is an interesting conclusion. I did not realize x86 already had 128 bit registers. That was really what I was referring to. We might not need to worry about addressing more memory yet but I feel like larger registers will still be useful.
Actually, x86 already has 256-bit vector registers (AVX), and will have 512-bit vector registers in a few years (AVX-512). If you include the exotic-but-still-x86-based xeon phi it already has 512-bit vector registers.
As soon as you can put 2^64 bytes of RAM in a PC, I guess.
IBM System i (AS/400) servers have supported 128 bit pointers for decades.
Nice. I suppose this is for memory-mapped file I/O? Or do those beasts really have access to terabytes of core? (In which case I will become jealous of my dad, who works with AS/400s.)
The former. They don't support any more physical memory than other high-end servers.
Your comment made me lol. +1, easily.

No seriously, there is little to no speed difference in 32-bit vs 64-bit computing. The REAL benefit is the ability to address memory beyond 4GBs. But even the most high-end phones are stuck with 2GB... hell, a number of laptops are still shipping with 2GB of RAM. Let alone phones. The 64-bit "advantage" is almost entirely a marketing gimmick.

That said, it is a known fact that Apple's iPhone chip is leagues ahead of its competitors in terms of performance / watt. Qualcomm has a value buy with Snapdragon and their integrated LTE chip + FCC conformance. But Apple has full integration, and controls the software / hardware from bottom up. That is a real advantage that is leading to improved battery lives and faster performance.

(This is less an "Apple Advantage" and more of an "Android disadvantage". Android should be able to catch up if they got their act together... but the reliance on Dalvik-VM code and unoptimized APIs leads to noticeably worse battery life)

Beyond that, Apple is beginning to invest into state-of-the-art foundries, possibly to create chips in house by the year 2016 or 2017. They've also bought out the high-end 20nm wafers through 2014, forcing their competitors to lag behind on the older 28nm process nodes. (Hell, even AMD / NVidia are feeling the sting. All AMD / NVidia roadmaps to better GPUs are at 28nm technology... Only Intel: who has reached 14nm on their in-house labs, remains unaffected by Apple's purchasing power)

When your company has $100 Billion in cash, you can afford to have a process node advantage over your opponents.

No seriously, there is little to no speed difference in 32-bit vs 64-bit computing. The REAL benefit is the ability to address memory beyond 4GBs.

No, there really is a huge performance gain for 64-bit processing for certain algorithms when coded correctly. Basically anything that works with vector-like data can easily benefit. I'm sure there are lots of mobile multimedia developers who relish the change to 64-bit.

(Of course it's possible the 32-bit predecessor to this chip special-cased certain 64-bit operations, e.g. double float arithmetic, in which case even fewer algorithms would benefit from widening registers across the board. I'm not familiar enough with ARM architecture to comment on this.)

Mobile multimedia developers rely on hardware decode for codecs, Fourier transforms, and the like. Such code can only slow down if moved to a vector processing unit similar to Intel's SSE. Embedded web video is standardizing upon H264, and accelerated audio is everywhere as well.

Game Programmers will prefer a faster GPU, since none of that stuff is actually calculated on CPUs now-a-days. (in fact, Apple's superior GPU is one of the reasons why it "feels" so much faster than many Android stuff).

So unless you're gonna be doing software-decode of H265 (or some other future codec), or something... my bet is that multi-media processing will remain the same. It will go to the dedicated multimedia DSP that is on every phone, and be translated extremely efficiently (powerwise).

Mobile multimedia developers rely on hardware decode for codecs, Fourier transforms, and the like

Uh, no? Yes, video decode for common formats is hardware-accelerated, but I've never seen dedicated Fourier transform hardware in consumer hardware, and I can't think of any other "and the like" algorithms that are hardware accelerated not at a CPU register level.

Game Programmers will prefer a faster GPU, since none of that stuff is actually calculated on CPUs now-a-days

Mm, I think this is dubious. I agree, GPUs are better than CPUs for many multimedia applications, but getting data to and from GPUs is not fast. And of all the multimedia applications I run on my desktop (mplayer, Audacity, the Gimp, Inkscape), none currently use the GPU except for maybe mplayer for certain videos.

DxVA passes tasks like iDCT (Inverse Discrete Cosine Transform) to the GPU on Windows. If you are running ANY DxVA codec on any Windows computer, the process happens exactly as I've described.

In fact, Intel's DxVA implementation explicitly has an iDCT accelerator. See this paper for details: http://download-software.intel.com/sites/default/files/artic...

I assume a lot of people watch Youtube on Windows computers, amirite? The iDCT is basically a Fourier Transform as far as the math is concerned. Other portions of the H264 codec (such as motion compensation) are similarly increasingly hardware-accelerated... even on crappy integrated GPUs like the old GMA950.

Phone hardware on the other hand, is basically state-of-the-art. I wouldn't be surprised if phones of today had superior hardware decoders than the crap that Intel churnned out for the bottom-barrel consumers back in 2009.

Congratulations, you have proved the tautology that hardware-accelerated frequency-domain codecs use hardware-accelerated frequency-domain transforms.

Unfortunately you entirely missed my point about everything other than video decoding. Bandwidth between the CPU and GPU quickly becomes the bottleneck, unless you're able to move most of your processing onto the GPU, which I granted you was the right thing to do. But also as I stated, none of the popular software I use actually does this. It is all optimized for CPU processing.

DxVA suffers from this same issue, i.e. you have to be very careful around moving data to & from the GPU: http://en.wikipedia.org/wiki/DXVA#DXVA2_implementations:_nat...

EDIT: And in case you think I'm talking out of my ass, I work on a high-performance embedded product. We recently switched from a 32-bit to a 64-bit version of the (ARM-like) processor we use. Nearly every single one of our major algorithms benefited from the increased register width (although we did have to slightly modify some of them to do so). And we don't even use multimedia operations. A lot of the gains come from simply moving less stuff around, which, when you have to process a packet every 40 cycles, really adds up.

> "Beyond that, Apple is beginning to invest into state-of-the-art foundries, possibly to create chips in house by the year 2016 or 2017. They've also bought out the high-end 20nm wafers through 2014, forcing their competitors to lag behind on the older 28nm process nodes. (Hell, even AMD / NVidia are feeling the sting. All AMD / NVidia roadmaps to better GPUs are at 28nm technology... Only Intel: who has reached 14nm on their in-house labs, remains unaffected by Apple's purchasing power)"

Very interesting. Would you have a source? I know Apple did the same "trick" with touch screens when the iPhone was introduced. I think these actions are typical of Cook - from what I understand he was earlier (before he became CEO) responsible for the whole inventory chain management and it's likely an area he's still involved in.

On Apple buying a fab:

http://semiaccurate.com/2013/07/12/apple-has-their-own-fab/ http://www.tomshardware.com/news/TSMC-Samsung-Apple-UMC-A-Se... http://appleinsider.com/articles/13/07/12/rumor-apple-buys-i...

On Apple eating up wafer supply: its mostly speculation from even more illegitimate rumor sites. But those keeping up with the current roadmaps note that Apple and Samsung are reaching 20nm far before their competitors... and even AMD / NVidia have their 20nm plans pushed out to 2015 or later.

"Proof" is nonexistent, but the product roadmaps I've seen seem to match the rumors.

Wouldn't a 64 bit processor be able to perform more accurate floating point operations faster? I definitely don't believe the only benefit to a 64 bit architecture is being able to address more memory.
Depends highly on the architecture. Both floating-point and vector operations are often special-cased in the pipeline (e.g. x86), so e.g. 64-bit floating point operations on a particular 32-bit processor may not exhibit worse performance than if that processor had 64-bit registers.

I'm not familiar enough with the particulars of ARM to answer confidently for floating point operations, but to take an example that's not usually special-cased, say bit vector arithmetic, yes, those operations will execute twice as quickly if they are vectored.

ARM 32-bit did not have SIMD (vector) double precision, while ARM 64-bit does, so here it's definitely a win.

On x86 though, both 32-bit and 64-bit did double precision vectors just fine, so it didn't really apply there (except that the fp register count was doubled).

Only if your code were SIMD aligned. But that is not code that your typical compiler outputs.

Most SIMD code is heavy number-crunching stuff like multimedia or GPU shaders. But much of that low-level handling is handled off CPU on phone platforms. It is simply more power efficient to have a hardware decoder of multimedia.

Only if your code were SIMD aligned. But that is not code that your typical compiler outputs.

GCC has supported autovectorization for a while now.

"Unoptimized code will be slow" isn't a great argument anyway. There's not much a processor can do to help that.