| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 1539 days ago

Well, sure; but the problem of font rendering specifically is an "embarrassingly parallel" one, isn't it? If you've got 1000 glyphs at a specific visual size to pre-cache into alpha-mask textures; and you've got 1000 GPU shader cores to compute those glyphs on; then each shader core only needs to compute one glyph once.

Can a CPU really be so much faster than these cores that it can run this Turing-complete font rendering program (which, to be clear, is already an abstract machine run through an interpreter either way, whether implemented on the CPU or the GPU) consisting of O(N) interpreted instructions, O(N) times, for a total of O(N^2) serial CPU computation steps; in less than the time it takes the O(N) GPU cores to run only O(N) serial computation steps each? Especially on a modern low-power system (e.g. a cheap phone), where you might only have 2-4 slow CPU cores, but still have a bounty of (equally slow) GPU cores sitting there doing mostly nothing? If so, CPUs are pretty amazing.

But even if it were true that it'd be faster in some sense (time to first pixel, where the first rendered glyph becomes available?) to render on the CPU — accelerators don't just exist to make things faster, they also exist to offload problems so the CPU can focus on things that are its comparative advantage.

Analogies:

- An apprentice tradesperson doesn't have to be better at a delegated task than their mentor is; they only need to be good enough at the task to free up some time for the mentor to focus on getting something higher-priority done, that the mentor can do and the apprentice (currently) cannot. For example, the apprentices working for master oil painters did the backgrounds, so the master could focus on portrait details + anatomy. The master could have done the backgrounds faster! But then that time would be time not spent working on the foreground.

- Ethernet cards. CPUs are fast enough to "bit bang" even 10GBe down a wire just fine; but except under very specific situations (i.e. dedicated network-switches where the CPU wants to process every packet synchronously as it comes in), it's better that they don't, leaving the (slower!) Ethernet MCU to parse Ethernet frames, discard L2-misdirected ones, and DMA the rest into kernel ring-buffer memory.

- Audio processors in old game consoles like the SNES's S-SMP and the C64's SID — yes, the CPU could do everything these could do, and faster; but if the CPU had to keep music samples playing in realtime, it wouldn't have much time to do things like gameplay (which usually goes together with playing music samples!)

Offloading font (or generalized implicit-shape) rendering to the GPU might not make sense if you're just computing letterforms for billboard textures in a static 3D scene (rather the opposite!) but in a game that wants to do things like physics and AI on the CPU, load times can likely be shorter with the GPU tasked with the font rendering, no? Especially since the rendered glyph-textures then don't have to be loaded into VRAM, because they're already there.

2 comments

Jasper_ 1539 days ago

Having a queue of 1,000 independent work items to do doesn't mean something is "embarrassingly parallel". Operating systems are a classic example of something that's hard to parallelize, and they have 1,000 independent processes they need to schedule and manage. Heterogeneous tasks makes parallelism hard!

Cores in GPUs do not operate independently, they have hierarchies of memory and command structure. They are good at sharing some parts and terrible at sharing other parts.

Exploiting the parallelism of a GPU in the context of curve rasterization is still an active research problem (Raph Levien, who has posted elsewhere in this thread, is one of the people doing the research), and it's not easy.

I restrained from commenting on the specifics of how curves are rasterized, but if you want to imagine it, think about a letter, maybe a large "g", think about the points that make it up, and then come up with an algorithm to find out whether a specific point is inside or outside that outline. What you'll quickly realize is that there's no local solution, there's only global solutions. You have to test the intersection of all curves to know whether a given pixel is inside or outside the outline, and that sort of problem is serial.

The work division you want (do a bit of work for each curve), is exactly backwards from the work division a normal GPU might give you (do a bit of work for each pixel), pushing you towards things like compute shaders.

I could go on, but this comment thread is already too deep.

link

derefr 1538 days ago

That's super interesting, actually!

> The work division you want (do a bit of work for each curve), is exactly backwards from the work division a normal GPU might give you (do a bit of work for each pixel)

Doesn't this mean that you could:

1. entirely "offline", at typeface creation time:

1a. break glyphs into their component "convex curved region tiles" (where each region is either full, empty, or defined by a curve with zero inflection points)

1b. deduplicate those tiles (anneal glyph boundaries to minimize distint tiles; take advantage of symmetries), to form a minimal set of such curve-tiles, and assign those sequence numbers, forming a "distinct curves table" for the typeface;

1c. restate each glyph as a grid of paint-by-numbers references (a "name table", to borrow the term from tile-based consoles) where each grid position references its tile + any applied rotation+reflection+inversion

2. Then, at scene-load time,

2a. take each distinct curve from the typeface's distinct-curves table, at the chosen size;

2b. generate a (rather large, but helpfully at most 8bpp) texture as so: for all distinct-curve tiles (U pos), for all potential angled-vector-line intersections (V pos), copy the distinct-curve tile, and serialize the intersection data into pixels beside it

2c. run a compute shader to operate concurrently over the workload tiles in this texture to generate an output texture of the same dimensions, that encodes, for each workload, the alpha-mask for the painted curve for the specified angle, iff the intersection test was good (otherwise generating a blank alpha-mask output);

2d. (this is the part I don't know whether GPUs can do) parallel-reduce the UxV tilemap into a Ux1 tilemap, by taking each horizontal strip, and running a pixel-shader that ORs the tiles together (where, if step 2c is done correctly, at most one tile should be non-zero per strip!)

2e. treat this Ux1 output texture as a texture atlas, and each typeface nametable as a UV map for said texture atlas, and render the glyphs.

To be clear, I'm not expecting that I came up with an off-the-cuff solution to an active "independent research problem" here; I'm just curious why it doesn't work :)

link

Jasper_ 1538 days ago

If you allow yourself to do this work offline, that's one thing, but keep in mind that 2D realtime graphics are a requirement. People still need to render SVGs, HTML5 canvas, the CSS drawing model, etc. Grid fitting might eventually go out of favor for fonts, but that's something that means you need different outlines for different sizes of fonts. See Behdad's excellent document on the difficulties of text subpixel rendering and layout [0]. Also, there's things like variable fonts which we might want to support.

The work to break a number of region tiles such that each tile has at most one region might be too fine-grained (think about tiger.svg), and probably equivalent in work compared to rasterizing on the CPU, so not much of a gain there. That said, tiled options are very popular, so you're definitely on to something, though tiles often contain multiple elements.

Going down this way lies ideas like Pathfinder 3, Massively Parallel Vector Graphics (Gan et Al), and my personal favorite, the work of adamjsimmons. I have to read this comment [1] a bit between the lines, but I think it's basically that a quadtree or other form of BVH is computed on the CPU containing which curves are in which parts of the glyph, and then the pixel shader only evaluates the curves it knows are necessary for that pixel. Similar in a lot of ways to Behdad's GLyphy.

I have my own ideas I eventually want to try on top of this as well, but I think using a BVH is my preferred way to solve this problem.

[0] https://docs.google.com/document/d/1wpzgGMqXgit6FBVaO76epnnF... [1] https://news.ycombinator.com/item?id=18260138

EDIT: You changed this comment between when I was writing and when I posted it, so it's not a reply to the new scheme. The new scheme doesn't seem particularly helpful for me. If you want to talk about this further to learn why, contact information is in my HN profile.

link

kllrnohj 1539 days ago

> If you've got 1000 glyphs at a specific visual size to pre-cache into alpha-mask textures;

How often does that happen? There are definitely languages where that is a plausible scenario (eg, Chinese), but for the majority of written languages you have well under 100 glyphs of commonality for any given font style.

And then as you noted, you cache these to an alpha texture. So you need all of those 1000 glyphs to show up in the same frame even.

> Especially on a modern low-power system (e.g. a cheap phone), where you might only have 2-4 slow CPU cores, but still have a bounty of (equally slow) GPU cores sitting there doing mostly nothing?

But the GPU isn't doing nothing. It's already doing all the things it's actually good at like texturing from that alpha texture glyph cache to the hundreds of quads across the screen, filling solid colors, and blitting images.

Rather, typically it's the CPU that is consistently under-utilized. Low end phones still tend to have 6 cores (even up to 10 cores), and apps are still generally bad at utilizing them. You could throw an entire CPU core at doing nothing but font rendering and you probably wouldn't even miss it.

The places where GPU rendering of fonts becomes interesting is when glyphs get huge, or for things like smoothly animating across font sizes (especially with things like variable width fonts). High end hero features, basically. For the simple task of text as used on eg. this site? Simple CPU rendered glyphs to an alpha texture is easily implemented and plenty fast.

link