| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gschoeni 914 days ago

Hey all, I ran some experiments benchmarking fine-tuning ViT, ResNet50, and CLIP on a Facial Emotion Recognition dataset. I had read the original papers the past few weeks, but wanted to do some practical hands on use of the models themselves.

https://blog.oxen.ai/practical-ml-dive-how-to-customize-a-vi...

~ TLDR ~ ViT works the best in this small experiment, with minimal code. The experiment was classifying 7 different facial emotions such as "happy", "sad", "angry", etc...

Model Accuracy

* ViT - 69% * ResNet50 64% * Zero-Shot CLIP - 53%

Was honestly most impressed with CLIP's ability for zero-shot transfer, even though it had the worst accuracy. The ability to give it a freeform list of prompts or labels and it will automatically classify into the subset without training feels like the future of prototyping products and models, then once you define your use case go with something more performant like a ViT.

Anyways, I had fun writing the code and running the experiments, so thought I would share!