But perhaps you should be counting the horrifically expensive operations (cache misses) in your asymptotic analysis, not the virtually-free operations (writes).
Only if those operations become more expensive as the problem size grows. The point of asymptotic analysis is the relationship between problem size and expense, not the expense of individual operations.
To put it another way, imagine an strange implementation of quicksort that caused a cache miss on every comparison and every swap, and an implementation of bubble sort that hardly ever caused a cache miss. Which would you use in your code?
Hm, glib answers aside, you're right. I think the real point is that counting writes makes no sense because it is both not the most expensive and not the most frequent operation involved. Really, the whole thing comes from looking at lists wrong. Find-and-delete is not O(1), and converting half of an O(n) operation to O(1) doesn't actually make an asymptotic difference. If you are genuinely just performing inserts and removals of known elements things will look different.
In a real-world app that was actually subject to performance requirements? If the bubble sort meets the goal where quick sort doesn't, hell yes that's what'll get picked.
Not all optimization is premature optimization. Some apps care. And the point of this article is that some operations which people are trained to think of as "fast" (like the "O(1)" pointer operations in a list traversal) actually aren't.
I suspect that the real-world app using bubble sort would be fine...until the day it was not fine. By that point, you'll have a bunch of people who depend on the software, while you scramble to figure out why things became so slow.
Constant factors should not be ignored, but neither should scalability. That is the reason that production-grade sorting algorithms switch from quicksort to bubblesort/selection sort when the sequence is short enough. That is also the reason we typically use quicksort rather than heapsort -- quicksort usually has a lower constant factor on modern architectures, which is what matters when the asymptotics are the same.
The problem with focusing on the performance of your system for particular input sizes is that you are usually not guaranteed that the input size will not increase. It is not premature optimization if you have a good reason to believe that the input size will not grow, or that it will not grow enough to matter. Such is the case with matrix multiplication. If you have no evidence of that, though, you are almost certainly better off choosing the asymptotically better algorithm first, and coming back to tune that / switch to difference algorithms on smaller inputs / etc. later on.
To put it another way, imagine an strange implementation of quicksort that caused a cache miss on every comparison and every swap, and an implementation of bubble sort that hardly ever caused a cache miss. Which would you use in your code?