Hacker News new | ask | show | jobs
by MyFirstSass 557 days ago
Wow this is bad. And by bad i mean worse than leading open source and existing alternatives.

Is it me or does it seem like OpenAI revolutionized with both chatGPT and Sora, but they've completely hit the ceiling?

Honestly a bit surprised it happened so fast!

10 comments

I think we're in the snapdragon age of AI for the next little bit, if you were around for early smartphones.

Each company would either rush to get a phone out with the new snapdragon chip, or take their time to polish a release and have a better phone late cycle. But the real improvements we're just the chip.

Nvidia chips/larger data centers are the chips. the models are the plethora of android phones each generation.

That kept going until progress stabilized. Then the best user experience & vertical integration won over chasing chip performance (apple).

Same goes with DALLE. It was cool to try it the first week or so but now the output is so much worse than Midjourney and stable diffusion. For me it can’t even generate straight lines and everything looks comic-ish.
DALL-E 3 image quality has always been subpar, but its prompt adherence is on par with FLUX. Midjourney has some of the worst prompt adherence, but some of the best image quality.
DALL-E 3 image quality was absolutely amazing... for about 3 days. Then they must have panicked, because after that, everything it emitted included that ridiculous telltale orange/blue tint.
To me this is just a simple artifact of size & attention.

Another example of this is stuff like Bluesky. There's a lot of reasons to hate Twitter/X, but people going "Wow, Bluesky is so amazing, there's no ads and it's so much less toxic!" aren't complimenting Bluesky, they're just noting that it's smaller, has less attention, and so they don't have ads or the toxic masses YET.

GenAI image generation is an obvious vector for all sorts of problems, from copyrighted material, to real life people, to porn, and so on. OpenAI and Google have to be extraordinarily strict about this due to all the attention on them, and so end up locking down artistic expression dramatically.

Midjourney and Stable Diffision may have equal stature amongst tech people, but in the public sphere they're unknowns. So they can get away with more risk.

>OpenAI and Google have to be extraordinarily strict

Why? Did the inventors of VHS tapes "have to be extraordinarily strict" and bake in safeguards because people might violate copyright laws, make porn, or tape something illegal?

Enforcing laws is the responsibility of the legal system. It sets a concerning precedent when companies like OAI would rather lobotomize their flagship products than risk them generating any Wrongthink.

If you're going to say something like this, you need to back it up with specific alternatives that provide a better result.

Besides just citing your sources, I'm genuinely curious what the best ones are for this so I can see the competition :)

HunYuan released by Tencent [1] is much better than Sora. It's 100% open source, is compatible with fine tuning, ComfyUI, control nets, and is receiving lots of active development.

That's not the only open video model, either. Lightricks' LTX, Genmo's Mochi, and Black Forest Labs' upcoming models will all be open source video foundation models.

Sora is commoditized like Dall-E at this point.

Video will be dominated by players like Flux and Stable Diffusion.

[1] https://github.com/Tencent/HunyuanVideo/

Something being available OSS is very different from a turnkey product solution, not to mention that Tencent's 60 GiB requirement requires a setup with like at least 3-4 GPUs which is quite rare & fairly expensive vs something time-sharing like Sora where you pay a relatively small amount per video.

I think the important thing is task quality and I haven't seen any evaluations of that yet.

> Something being available OSS is very different from a turnkey product solution, not to mention that Tencent's 60 GiB requirement requires a setup with like at least 3-4 GPUs which is quite rare & fairly expensive vs something time-sharing like Sora where you pay a relatively small amount per video.

It took two weeks to go from Mochi running on 8xH100s to running on 3090s. I don't think you appreciate the rapidity at which open source moves in this space.

HunYuan landed less than one week ago with just one modality (text-to-video), and it's already got LoRA training and fine tuning code, Comfy nodes, and control nets. Their roadmap is technically impressive and has many more control levers in scope.

I don't think you realize how "commodity" these models are and how closed off "turn key solutions" quickly get out-innovated by the wider ecosystem: nobody talks about or uses Dall-E to any extent anymore. It's all about open models like Flux and Stable Diffusion.

{Text/Image/Video}-to-Video is an inadequate modality for creative work anyway, and OpenAI is already behind on pairing other types of input with their models. This is something that the open ecosystem is excelling at. We have perfect syncing to dance choreography, music reactive textures, and character consistency. Sora has none of that and will likely never have those things.

> something time-sharing like Sora where you pay a relatively small amount per video.

Creators would prefer to run all of this on their own machines rather than pay for hosted SaaS that costs them thousands of dollars.

And for those that do prefer SaaS, there are abundant solutions for running hosted Comfy and a constellation of other models as on-demand.

If you've got a 4090 and ComfyUI can you run HunYuan?
There are already Hunyuan fp8 examples running on a 4090 on r/stablediffusion.
RunwayML too but not sure they also won't get commoditized by OSS video generation.
What are the leading alternatives? (Open source or otherwise)
You have to be specific. What's more important to you?

- uncensored output (SD + LoRa)

- Overall speed of generation (midjourney)

- Image quality (probably midjourney, or an SDXL checkpoint + upscaler)

- Prompt adherence (flux, DALL-E 3)

EDIT: This is strictly around image generation. The main video competitors are Kling, Hailuo, and Runway.

SD does not generate video, does it?
It does as of recently.
Minimax (from China) and Kling 1.5 from China. Recently Tencent launched its own.

You can see more model samples heee https://youtu.be/bCAV_9O1ioc

Those look... far worse? What am I missing.
Exactly I don't know how people are saying SORA is bad. I know there are restrictions with humans. But with the storyboard and other customisations, it's definitely up there!
FLUX
MidJourney (commercial), Standard Diffusion XL
> Standard Diffusion XL

you probably meant Stable Diffusion XL. (autocorrect victim)

Sora was not really that big of a revolution, it was just catching up with competitors. I would even say in gen video they are behind right now.
Sora had some sweet cherry picked initial hype videos. That was more impressive than anything we could do at the time. Now, yea, it's questionable if it's on-par let alone better.
Wasn't just cherry picked. The balloon kid video had a VFX team cleaning up the output. They've said that now.
What is the best model in your opinion right now?
There are a lot of them, but Runway seems to have good controls and they are aligned with people who will actually use it - filmmakers and content creators.

In terms of image quality. Runway, Luma, and a few of the Chinese models all give "ok" results. I haven't seen anything from Sora to convince me they have done any kind of significant leap.

The issue there is alignment. It's cheap for Runway or Luma to continue in this path since it's their only path to profitability, they do nothing else.

But for OpenAI, I don't think this is near their top list of priorities. I doubt that they will be able to keep adding features like their competitors. Seems to me like this is the equivalent of a side project for them.

edit after watching direct comparison videos, I've changed my mind. Sora is ahead.

UPDATE: After watching direct comparison videos between prompts, I do think now that Sora is ahead. It's not a huge leap but it seems much better at keeping fine details roughly aligned.

For anyone who is curious where to find tons of SORA videos, go to reddit r/aivideo

HunYuan by Tencent. It's 100% open source too.
RunwayML
Bad also in the sense once you get over the "boy, it's amazing they can do that", you immediately think "boy, they really shouldn't do that".
My working theory is that OpenAI is the 'moonshot' kind of company full of super smart researchers who like tackling hard problems, but have no time and effort for things like 'how do we create an UX people actually want to use', which actually requires a ton of painful back-and-forth and thoughtful design work.

This is not a problem as long as they do the ChatGPT thing, and sell an API and let others figure out how to build an UX around it, but here they seem to be gunning for creating a boxed product.

Yeah… they have defined the UX that everyone else is copying thus far. So I feel like you are pretty far off the mark.
No doubt. I was waiting so long for Sora but Runway already burned me out on AI video.

It was fun for a few days but far more limited than I would have ever expected.

Maybe Sora 5.0 will be something special. Right now though all these video models are basically shit.

What are some of the open source video models?
Could it be that text sources are plenty, and more dense than training for videos, and images?