Hacker News new | ask | show | jobs
by steinvakt2 52 days ago
This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.

Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.

8 comments

It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.
Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.

i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.

[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...

Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/
To be fair, his Midas touch is a result of consistency and a lot of hard work.

It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.

I thought they rolled it as well?
As always with people: listen to what they say, not to what they do...

After all, they rarely do what they say themselves, so it's surely not entirely made up nonsense!

well duh, they updated the news section

https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...

which is microsoft for "we removed two dead links". AI innovation knows no limits!

Interestingly that seems to be in response to [1], which might indeed be the trigger for this.

[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...

Yes, the SOTA is currently much more advanced.
What do you consider to be SOTA?
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.

> ...it was randomly adding music...

I've been noticing this with the Mistral Voxtral TTS models too. I have my AI record a morning briefing podcast for myself, and occasionally there are sounds like music at the start (the british voice had a musical tone underneath that sounded a little like the end of the BBC News theme). I don't think I've ever encountered that with the OpenAI TTS models, so they're now my default go-to again.

yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.
The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.
you saved us a lot of time here.... i unstarred the repo

moving on....

I don't really pay attention to stars. Do people use them as bookmarks? Why would you star a repo if you knew so little about it?
Stars for me are basically "this might be interesting but I don't have time to look at it now, hopefully I'll think about it later and give it a second look".
I exclusively use stars as bookmarks which is why I always found it strange when people talked about lots of stars meaning high quality or trustworthy…I’ve learned since then that I’m probably in the minority (both in using stars as bookmarks and not caring about how many stars a repo has).
Judging by how many people apparently are paying bots to give their lazily vibe-coded repos thousands of stars, it seems like people both simultaneously take stars seriously while not taking them seriously at all. It breaks my brain.
You just saved me an afternoon.
Saved a lot of my time thanks!
I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.
Imagine the balls it took to willingly attach the Microsoft label to the front of the product that is Teams.
I mean the same can be said about most versions of Windows as well. People act like Windows 11 is where it all went sour, but I've personally kind of hated it since Windows XP.

I feel like a recurring pattern with Microsoft is to create something quickly, market it aggressively and push for everyone to use it immediately, and only once it is installed everywhere do people suddenly realize how terrible it is, but it's too late to change.

I'm surprised you picked XP as the falling point. I didn't enjoy the days of reinstalling 95/98/ME every 6 months to avoid driver weirdness and seemingly random failures. XP was built on the foundation of 2000, which tended to make it more robust vs. its predecessors.

Vista on the other hand...

I mean, part of it is that I really hated the Fisher Price look to it, but it was also the first time I ever felt like I had to "hack" things to make stuff work. I had to muck with registry keys. Oh, and it was the first time that I noticed that Windows repair tools do not work.

I suspect I might have hated 9x more but I was pretty young when they came out and I didn't really "get into" computers until XP, and I disliked it enough to dual-boot Linux as a twelve year old.