That would be colossally inefficient - essentially the size of the chip means that electrons would be taking multiple cycles to get from one side to the other. The solution would be localizing processing into distinct processing units on the one die. At the point you’ve reinvented multiple cores and it starts becoming cost effective to split them into separate chips to improve yields :)
The problem is the increase power usage of the additional caches that are necessary - modern CPUs already need a bunch of physically local caches in addition to the large L1/2/3/n caches because of timing of flowing electrons from A to B. At some point the benefit of larger single die becomes minimal. The moment that happens you benefit from making separate chips because of increased yield.
Most modern chips already use numerous clocks (aside from anything else propagation delays for the clock signal is already a problem).
The problem is not simply "because clock cycle" it is "if electron takes Xns to get from one execution unit to the next, then that's Xns of functionally idle time". That at best means additional latency. The more latency involved in computing a result the more predictive logic you need - for dependent operations the latency matters.
An asynchronous chip does not avoid that same problems encountered by a multistage pipelined processor, it's purely a different way to manage varying instruction execution times.
But this doesn't answer the killer problem of yield. The larger a single chip is the more likely any given chip is to have errors, and therefore the fewer chips you get out of a given wafer after the multiple weeks/months that wafer has been trundling through a fab. Modern chips put a lot of redundancy in to maximize the chance that sufficient parts of a given core survive manufacture to allow a complete chip to function, eg. more fabricated cache and execution units than necessary, at the end of manufacture any components that have errors are in effect lasered out. If at that point any chip doesn't have enough remaining cache/execution units, or an error occurs where it can't be redundant, the entire chip is dead.
The larger a given die is the greater the chance that the entire die will be written off.
That massive ML chip a few days ago worked by massively over prescribing execution units. I suspect that they end up with much greater lost area of a given wafer than many small chips, which directly contributes to actual cost.