There's no one answer to that since different models are.. different. Beyond just modalities (text input and image output? image input and video output?), there are different common underlying tools used to build them. And then, of course, what do you mean by API? How do you want to interact with it?
As a general thing, you'd take a request that would require an inference step, which would then invoke the model with some parameters and input, and return the output. Beyond that, you'd need more detail.
I specialize in this area and build a product for self hosted inference.
The challenge to support a new model architecture is about coding the preprocessing for inputs (like tokenization or image resizing and color feature extraction) and post processing the outputs (for example entity recognition needs to lookup the entities and align the text).
Once an architecture is coded for the pre/post processing, then serving a new model for inference with that architecture is easy!
As a general thing, you'd take a request that would require an inference step, which would then invoke the model with some parameters and input, and return the output. Beyond that, you'd need more detail.