Perhaps start with a video activity recognition model like X-CLIP [1].
It generates activity labels or free-text descriptions for segments of your video.
Then use text matching rules or, if the activities are diverse, semantic matching to select activities to either retain or remove corresponding fragments in the video.
Try a model demo first with one of your sports video to see what it outputs [2].
Alternative demos are also available on [3].