|
|
|
|
|
by ralferoo
319 days ago
|
|
That'd be one way of doing it. You don't technically need 4 reads per pixel either, for instance you can process a 7x7 group with a 64-count thread group. Each thread does 1 read, and then fetches the other 3 values from its neighbours and calculates the average. Then the 7x7 subset of the 8x8 write their values. You could integrate this into the first pass too, but then there would be duplication on the overlapped areas of each block. Depending on the complexity of the first pass, it still might be more efficient to do that than an extra pass. Knowing that it's only the edges that are shared between threads, you could expand the work of each thread to do multiple pixels so that each thread group covers more pixels the reduce the number of pixels sampled multiple times. How much you do this by depends on register pressure, it's probably not worth doing more than 4 pixels per thread but YMMV. |
|