Hacker News new | ask | show | jobs
by ashvardanian 1203 days ago
Thank you! Founder here :) You are right, those are the base papers, but we have extended the set of objectives quite significantly, tapping into modalities that haven’t been publicly CLIP-ed :)

It is probably worth writing a paper about, but we are just too busy building tons of open-source stuff. Check out the GitHub org here: https://github.com/unum-cloud

It is not just about the tranformers, but also about databases, networking, and improving the modern data stack for very large scale retrieval-based AI. A lot of the pieces may be pre-production, but I believe the amazing HN community may still enjoy the ways we use io_uring, SIMD, and a few other less then popular technologies.

4 comments

Are the pretraining and training pipelines available anywhere under a FOSS license? I'd love to take a swing at training a mid-fusion model on data other than text and images (e.g., sound, neuron spike trains, etc.)
Not yet, but you can ping our team on Discord or Twitter. They are soft like marshmallows, a couple of compliments and they will be leaking scripts left and right :)
> may still enjoy the ways we use io_uring, SIMD, and a few other less then popular technologies

Fairly standard for any perf-minded shop but good to see more people discovering them.

man I just looked at ukv, it looks to good to be true, 30x RocksDB, wtf! Hoping it's true
He-hey! Yes we are fast, but I don’t think we ever claimed 30x. We are faster in almost every workload (loose range scans for some reason), but at best by 7x (batch reads) and 5x (batch writes). Still, this should be plenty for all intents and purposes! I can post some updates on that tomorrow :)
If you are curious about how it works, here is a pretty good explanation: https://youtube.com/watch?v=ybWeUf_hC7o

For some reason the conference hasn’t made the last years talks public or searchable, but you should be able to access it with a link

where is the udisk? Repo is just a readme on configuration.
Yes, we decided to keep UDisk closed source for now. That repo is just a tiny description for the expected configuration files. At this point UDisk powers our soon-to-be-public cloud offering and is piloting in a few FAANG scale companies. Our human resources are very limited for now, but we can probably run a couple more such pilots concurrently. Reach out to info [at] unum.cloud or join our Discord if you are from one of those large companies and want to battle-test our secret sauce on a few Petabytes of your data :)