| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by heyitsguay 1647 days ago
	That's how i felt at first, but getting deeper into the Swin transformer paper it actually makes a fair bit of sense - convolutions can be likened to self-attention ops that can only attend to local neighborhoods around pixels. That's a fairly sensible assumption for image data, but it also makes sense that more general attention would better capture complex spatial relationships if you can find a way to make it computationally feasible. Swin transformers certainly go through some contortions to get there, and I bet we'll see cleaner hierarchical architectures in the future, but the results speak for themselves.

2 comments

robbedpeter 1647 days ago

The transformer in transformer (TnT) model looks promising - you can set up multiple overlapping domains of attention, at arbitrary scales over the input.

link

algo_trader 1647 days ago

But you have to pay the price for losing the inductive bias of cnns

Swin are still cpu/memory (and data) intensive compared to CNNs, right?

link

heyitsguay 1647 days ago

Not as much as you'd think. The original paper sets up its models so that Swin-T ~ ResNet-50 and Swin-S ~ ResNet-101 in compute and memory usage. They're still a bit higher in my experience, but i can also do drop-in replacements for ResNets and get better results on the same tasks and datasets, even when the datasets aren't huge.

link