| Oh cool! I've been doing similar experiments lately (using ViT's) to do card recognition, and so far it's been working really well for me. If you want to compare notes, I've open-sourced my code / weights [0] and written some blogs about how mine works [1]. I'd love to see if we can collaborate! > Push the inference to the client-side (WebGPU / Web Workers). I have an example of this working in webgpu / wasm here [2] along with a playground environment (demonstrated here [3]). I'm currently training a new version that uses a different ViT backbone more optimized for WASM inference -- it's currently converging, and I hope to have it finish training (or at least reach parity with the previous model) in about a week (took ~200 epochs for my last one to reach the level that it's at, and it takes about an hour per epoch in my current setup). You mentioned WebGPU -- I've run into issues with the MobileViT-XXS backbone producing bad results in WebGPU on Android, so YMMV in whether or not WebGPU is stable enough to use for this or not. I don't know if it's my problem or a true bug in the platform, but I've fallen back to WASM and things are working much better since then. [0] - https://github.com/HanClinto/CollectorVision [1] - https://blog.hanclin.to/posts/gh-19/ [2] - https://hanclinto.github.io/CollectorVision/ [3] - https://youtu.be/MHieOcmC7Dw |