I can think of ways to approximate FQ on an input-queued switch while maintaining performance, but it doesn't really help because the total number of queues being a function of the traffic still opens you up to resource exhaustion attacks.
I think FQ helps everywhere, and FQ + aqm is needed at every fast->slow transition on the network. Ideally the core is over provisioned... except where it is not.
The bottleneck is mostly on the read path,even leveraging xdp heavily. dpdk/vpp would be faster. In the case however of an ISP 10k subscribers/network segment at 25Gbit is more than enough and busies 40% of 20 cores, on libreqos. Others are making boxes capable of more using smarter ethernet cards. If only more ISPs cared about QoE, we could cause a run on ebay on slightly obsolete Xeons and finish up fixing the bufferbloat problem right quick!