The CPU itself has PCIe lanes, some of which are routed to the slots (PCI Express Graphics/"PEG lanes") and some of which are routed to the chipset. The chipset itself then provides multiplex capability to provide additional "chipset lanes" (but obviously having to share the bandwidth available over the trunk).
You could potentially put a 4.0 chipset on a 3.0 chip and it should nominally work, the chipset would provide 4.0 lanes to devices, but all the traffic would be multiplexed at 3.0 to go back to the chip so there wouldn't be a whole lot of point.
You could also put a 3.0 chipset on a 4.0 chip, which works fine and is even sensible for budget motherboards (this is how the AMD B550 chipset will work).
The things on the chipset tend to be slower devices that just need to be attached, not necessarily run fast. The chipset usually only gets 4 lanes total (so like, one NVMe drive saturates it) and hopping to the chipset adds latency which reduces IOPS on fast networking or Optane drives, reduces graphics performance for chipset-attached GPUs, etc.
On low end desktop (1151 and friends), there are 16 PCI-E lanes directly from the CPU (typically used by the GPU), and the rest are connected to the PCH, which connects to the CPU with DMI. The main problem with updating the PCH alone here is that the link between the CPU and the PCH is only ~4GB/s, so PCI-E 4.0 speeds for the lanes extending from the PCH are rather pointless.
On HEDT and modern server, (2066 and friends), only about 8 of the PCI-E lanes come from the chipset, and the rest come directly from the CPU. The QPI and it's successor UPI is used to connect CPUs to each other, not to a chipset.
It's apparently UPI now[1], but they'll still need to do a lot of updating to the CPU and UPI to get the bandwidth of pci-e 4.0. Best info I can find[1] puts pci-e 4.0 x16 at around 64GB/s, and UPI at a max of 28GB/s on their best xeon cpus and fpgas. Having that kind of bottleneck at the chipset would make it nearly pointless since even pci-e 3.0 with a 16x link is 32GB/s.
Chipset doesn’t run at x16 bandwidth, it gets 4 lanes of bandwidth. At least for consumer platform, but I doubt server runs anything significant off chipset. QPI/UPI may run faster between sockets but I don’t think chipset is a full speed link by any means.
Nominally QPI/UPI but those protocols are not hugely distinct from PCIe in general. AFAIK it’s basically PCIe but encrypted, so that nobody else can replicate their chipsets (like used to happen in the old days with nForce/etc).
Also your numbers are off, easy rule of thumb is that one PCIe 3.0 lane is one GB/s of bandwidth per lane. So 3.0x16 is 16 GB/s of bandwidth.
So it just needs 8 GB/s of bandwidth to run at 4.0x4 speeds.