Hacker News new | ask | show | jobs
Reverse-Engineering YouTube: Revisited (tyrrrz.me)
208 points by tyrrrz 1232 days ago
12 comments

> There is one thing that developers like more than building things — and that is breaking things built by other people.

Haha. This is not as universal as the author thinks. Every time I need to reverse-engineer something obscured on purpose, I wish we could just get along.

Every time I have to reverse-engineer something obscured by accident, I call it debugging.

But even if I solve the puzzle, it's like solving crosswords: I just defeated a human mind, the victory is transient, and will soon be forgotten. I'd prefer my victories to be against the frontier of knowledge, and to win universal truths. That means building things rather than tearing down those humans built.

I just wish there was more mathematical certainty and less human vices in programming.

Unfortunately the halting problem takes all your mathematical certainty and throws it out the window. It's very easy to take your application which will halt within a finite amount of time to one that will not. You'll find most programmers and companies are not going to spend the massive amount of time to ensure their logic is correct, but instead throw the application out there quickly and fix it based on crashes and feedback.
Mathematical certainty is what to leverage, not what to fight. You'd use it before you run into the halting prbblem, not after. Just like mathematics was used to discover the halting problem in the first place.

And what you're describing as happening in practice is precisely the disappointing part of prgramming.

Yea the fix it later approach is an excuse that software engineers get to enjoy. Civil engineers are liable for their mistakes, and face fines/sanctions for their work. Meanwhile, software engineers can get away with half ass logic or mishandling of data and nothing comes of it.

In South Korea, a company with known software vulnerabilities is fined everyday until they fix it. Gives incentive to making sure software does the right thing before it gets shipped.

> I'd prefer my victories to be against the frontier of knowledge, and to win universal truths.

You wouldn't need to tear down barriers if the people that built them thought the same in the first place. Nonetheless, keep up that attitude.

I don’t particularly have anything to add about the article. But I do enjoy using your desktop youtube downloader , as well as couple of your .net libraries. Especially CliWrap. Amazing work. Just wanted to say thanks!
Glad to hear that :)
This is also a good place to learn about it: https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/extracto...
Ive said it before on GitHub, I dont think `TVHTML5_SIMPLY_EMBEDDED_PLAYER` is the great solution everyone thinks it is. Yeah, you can get the age-restricted videos anonymously. However you can also get those videos by logging in, which the author doesn't mention:

    POST /youtubei/v1/player HTTP/1.1
    Host: www.youtube.com
    Authorization: Bearer ya29.a0AVvZVsqRwNWFI3R0MSxnugyNlxbqIOXcwXkeA6NMOcpv_...
    
    {
     "contentCheckOk": true,
     "context": {
      "client": {
       "clientName": "ANDROID",
       "clientVersion": "18.04.35"
      }
     },
     "racyCheckOk": true,
     "videoId": "Cr381pDsSsA"
    }
and `TVHTML5_SIMPLY_EMBEDDED_PLAYER` comes with strong drawbacks. Some videos under that client require a JavaScript signature for BOTH downloading and unthrottling. Each person is welcome to their own opinion, but I just dont think its worth the complexity of parsing some arbitrary JavaScript with Python when you can just log in (programmatically as above). Personally I use the ANDROID client, which avoids all JavaScript signatures. Also not mentioned in the article is that you can actually take the throttled URLs as is, and download pieces concurrently for a pretty good result. So each piece is still downloading slowly, but if you use on the order of 99 connections, you get decent speed. You would think you get IP blocked or something for this, but I downloaded quite a bit using this method as a test and the YouTube server allowed it. The combined resultant speed was only something like 2 MB/s, so big picture it doesn't seem like an abuse. My YouTube OAuth code is here for any interested:

http://2a.pages.dev/mech

> However you can also get those videos by logging in, which the author doesn't mention.

> Also not mentioned in the article is that you can actually take the throttled URLs as is, and download pieces concurrently for a pretty good result.

The author mentioned both, the login option as well as the chunking mechanism. Sorry, but did you actually read the blog post?

they mention cookies. that not the correct method for authenticating to the API, OAuth is.
If you can afford to always be logged in, then sure, but it's not always an option. Especially if you need a general solution.
who said anything about always? you log in as needed. most videos are open.
I mean, if you're not the sole user of the tool, you can't guarantee that everyone can log in (or would want to)
I doubt anyone WANTS to log in. 90% of videos are open. Yes, logging in is a burden, but coding wise its literally one extra line in the HTTP request. I think thats a fair tradeoff compared to parsing arbitrary JavaScript.

each person can weigh the pros and cons and make their own decision, but I dont think its as black and white that TVHTML5_SIMPLY_EMBEDDED_PLAYER is the best option, and the article doesn't even discuss OAuth, so I dont think its presenting a balanced take on the different approaches.

If you really want to understand how streaming video works then it definitely takes you down a couple rabbit holes - but it’s worth it. I think more people and companies should try to stream their own video content rather then be at the mercy of Google, their algorithms, and their censorship. You don’t have to “be” YouTube and host other users content but you should be able to host your content without YouTube’s approval.
There are probably many paid video hosting platforms. You can't save that much by hosting it yourself.

Anyone who is hosting on YouTube is looking for a free service.

> Anyone who is hosting on YouTube is looking for a free service.

https://archive.org/help/video.php is also “free”.

Interestingly I found that YouTube's web UI actually requests range URLs rather than range HTTP headers, allowing it to seek around the video faster than mpv with yt-dlp (and conveniently avoiding throttling as well). I suspect this may be related to DASH: https://github.com/mpv-player/mpv/issues/10601

Unfortunately mpv and ffmpeg do not currently have mature DASH support and cannot benefit from fast seeks: https://github.com/mpv-player/mpv/issues/7033 (didn't look deeply)

> and conveniently avoiding throttling as well

throttling is not avoided. the YouTube web client generates a JavaScript signature that disables the throttling, same as what the code in the article does.

Signature is old news (couple of years), generating proper one (or straight up copying from YT using devtools) wont get you unthrottled access.
&n= is not "same as what the code in the article does". Article talks about old signatureCipher/sp/s/sig code. Without signatureCipher urls return 403, with signatureCipher but without decoded &n= urls return fine, but start throttling after just over ~1MB. My comment from Oct 2021:

https://github.com/ytdl-org/youtube-dl/issues/29326#issuecom...

"server heavily throttles any request to same URL after initial 2-4MB regardless of retries.

&n is only part of the puzzle. While bad or no &n will indeed trigger 50KB/s throttling, even correct &n only lets you download at most couple megabytes at good speed. Try any video in official YT client and you will see repeated URL request with different &range= parameters all use same &n, but trying to download that URL all at once will always throttle after initial ~2-4MB.

The correct solution (after generating correct &n) is to start using custom URL &range= parameters instead of normal HTTP range headers and default to downloading in 2MB chunks."

That was the case in 2021. I just checked newest "fixed" yt-dlp and if you change chunk size from default 10MB to 100MB you will quickly notice throttling is STILL there kicking right around that 2-4MB mark, but instead of brutal 50KB/s its somewhere around 1MB/s. Default Chunk size of 10MB somewhat helps to mask/hide it by smoothing jumping up and down transfer. youtube-dl (im shocked its still updated, cant download .exe, have to download zip and run "python.exe __main__.py"?) just silently ignores "--http-chunk-size 100000000" altogether and keeps downloading in 10MB chunks to hide the problem. "--print-traffic" shows 10MB chunks.

This is for all stream types other than 22 (mp4 1280x720 avc1.64001F, 30fps, mp4a.40.2). 22 seems to be special and with proper &n= you can slurp whole file with one connection without additional throttling, probably for backward compatibility with older clients?

TLDR: You can still download YT videos IF you chop them up into small chunks. Playing back without chopping up into chunks somewhat "works" because 1MB/s=10Mbit is still above thickest juiciest bitrate YT would ever serve, but problems become obvious when you start fast forwarding/skipping around the video (1-3 second pauses in mplayer). Playing type 22 works great and seeking is instant.

"Personally I use mplayer to stream YT and am currently on a lookout for a simple proxy server I could modify to do the above (divide into chunks, rewrite HTTP range header into URL parameter) for me transparently in the background."

Its too bad such an important resource (youtube) has a secret API - that changes all the time.
> secret API

In their client-side code, they provide a worked example of how to use their API. That's hardly the way to keep a secret.

Why? YouTube has a proper public API, that doesn't change all the time.
They however a few years ago started forcing API users to authenticate, so when I had to spend months in bed after a bad road accident and later a heart attack, I couldn't anymore watch my favorite electronics channels using the Kodi YT extension unless I would authenticate. I guess they still allow anonymous use with a browser only because by doing that they can profile more people.
The proper public API notably does not provide access to the raw video steam making it useless for many use cases
Can you retrieve videos with it?
Furthermore, the economics of video hosting sites like YouTube are such that you have truly incredible storage, server, and bandwidth growth, basically forever. I don’t think it’s feasible for there to be a “free” API that lets people use YouTube as they please, build clones of the site with no ads, etc.
Well that link to the introduction of Prolog video is not a really good starting point.
Why do the comment counts almost never match the actual number of comments? I know the answer is censorship but why doesn’t YouTube shadow-ban the comment count when they shadow-ban comments?
Honestly i don’t know what type of comment moderation they are doing but it’s pretty horrible. I constantly see obvious spam links or scammers as first level nested comments, often pretending to be the video author doing “giveaways” or trying to siphon off information. It’s incredibly widespread and has been happening for months at least.
Ok that’s an example of seeing spam comments, I’m asking about the comments you CAN’T see. For example the link says 3 replies but when you click it there is only 2 comments listed. Is there a technical reason why the reply count is not updated when comments are removed?
Does anyone know if YouTube runs Ffmpeg internally?
Circumstantial evidence that they used to: https://multimedia.cx/eggs/googles-youtube-uses-ffmpeg/

These days they use special hardware accelerators: https://gwern.net/doc/cs/hardware/2021-ranganathan.pdf

I built software to efficiently run a large number of GPUs (>120k) in data centers. That second link is fantastic, but it really gives me PTSD. =)
While not fully related to the code itself, my daughter has a school provided Chromebook that blocks almost all Youtube video content. You can browse the YT site, but the thumbnails and videos won't load. I'm assuming there is some kind of content block occurring here based on some part of the URL.

Well, kids being clever figured out the Chromebook browser shows a preview video if you hit the 'share' button and go to embed video. This is not content blocked. I didn't dig in to see if it would play age restricted content as I assume all access is being logged somewhere and want to minimize future fall out.

And now many of these bypasses and tricks will stop working.