|
|
|
|
|
by Jack000
982 days ago
|
|
I think this approach isn't ideal because you're representing pixels as 150x150 unique bins. With only 71k fonts it's likely a lot of these bins are never used, especially at the corners. Since you're quantizing anyways, you might as well use a convnet then trace the output, which would better take advantage of the 2d nature of the pixel data. This kind of reminds me of dalle-1 where the image is represented as 256 image tokens then generated one token at a time. That approach is the most direct way to adapt a causal-LM architecture but it clearly didn't make a lot of sense because images don't have a natural top-down-left-right order. For vector graphics, the closest analogous concept to pixel-wise convolution would be the Minkowski sum. I wonder if a Minkowski sum-based diffusion model would work for svg images. |
|