Hacker News new | ask | show | jobs
by 1vuio0pswjnm7 1039 days ago
"Have you ever tried to download videos from YouTube? I mean manually without relying on software like youtube-dl, yt-dlp or one of "these" websites. It's much more complicated than you might think."

This reminds me of some sort of fizzbuzz test. This is not complicated at all. There is no need to use the Range header or run Javascript.

The short script below does not download anything because there is no need. It does not use Range headers, it does not run Javascript and it makes only one TCP connection. With the JSON it fetches, one can simply extract the videoplayback URLs and put them in a locally-hosted HTML page with no Javascript.

    #!/bin/sh
    # usage: echo videoId | $0 <-- this will indicate len to use    
    # usage: echo videoId | $0 len | openssl s_client -connect www.youtube.com:443 -ign_eof
    # usage: $0 len < videoId-list | openssl s_client -connect www.youtube.com:443 -ign_eof
    
    (
    while read x;do
    test ${#x} -eq 11||continue
    if test $# -ne 1;then len=${#x};x=$(grep -m1 ^\{ $0|sed 's/\$x//'|wc -c);exec echo usage: ${0##*/} $((x+len));fi
    
    cr=$(printf '\r');
    sed "/^[a-zA-Z].*: /s/$/$cr/;s/^$/$cr/" << eof 
    POST /youtubei/v1/player?key=AIzaSyA8eiZmM1FaDVjRy-df2KTyQ_vz_yYM39w HTTP/1.1
    Host: www.youtube.com
    Content-Type: application/json
    Content-Length: $1
    Connection: keep-alive
    
    {"context": {"client": {"clientName": "IOS", "clientVersion": "17.33.2" }}, "videoId": "$x", "params": "CgIQBg==", "playbackContext": {"contentPlaybackContext": {"html5Preference": "HTML5_PREF_WANTS"}}, "contentCheckOk": true, "racyCheckOk": true}
    eof
    done
    printf '\r\n'
    printf 'GET /robots.txt HTTP/1.0\r\nHost: www.youtube.com\r\nConnection: close\r\n\r\n';
    )
    
For processing the JSON I wrote custom utilities in C that (a) extract videoIds and other useful strings, (b) generate HTTP similar to above, and (c) filter the returned JSON into CSV, SQL or HTML. For me, these run faster than Python and jq and are easier to edit. Using these utilities I can also do full searches that return hundreds to thousands of results and I can easily exclude all "suggested" or "recommended" videos.

CSV output

1666520150,23 Oct 2022 10:15:50 UTC,22,aqz-KE-bpKQ,"Big Buck Bunny 60fps 4K - Official Blender Foundation Short Film",00:10:35,635,UCSMOQeBJ2RAnuFungnQOxLg,19211597,"Blender"

SQL output

INSERT INTO t1(ts,utc,itag,vid,title,dur,len,cid,views,author) VALUES(1666520150,'23 Oct 2022 10:15:50 UTC',22,'aqz-KE-bpKQ','Big Buck Bunny 60fps 4K - Official Blender Foundation Short Film','00:10:35',635,'UCSMOQeBJ2RAnuFungnQOxLg',19211597,'Blender') ON CONFLICT(vid) DO UPDATE SET views=excluded.views;

HTML output

Looks just like CSV except vid is a hyperlink

3 comments

I think your definition of "not complicated at all" differs from most people.
This is demonstration of W3C Ethical Web Principles 6.11 and 6.12

https://www.w3.org/TR/ethical-web-principles/

Looks interesting. In the post, the author does

    echo -n '{"videoId":"aqz-KE-bpKQ","context":{"client":{"clientName":"WEB","clientVersion":"2.20230810.05.00"}}}' | 
      http post 'https://www.youtube.com/youtubei/v1/player' |
      jq -r '.streamingData.adaptiveFormats[0].url'
which is very similar to what you do, but runs into an issue of throttling to ~70Kbps. Is the difference just the "key" parameter? Do you get no throttling?
There is no throttling when using the JSON returned by the HTTP request in the shell script or generated by the utilities I wrote.

it's not the key the author is using, it's the post-data.

Moreover, to get throttled videoplayback URLs with the "WEB" key and client info like the author is using, one does not need to make POST requests to /youtubei/v1/player. There are throttled videoplayback URLs in the HTML of the /watch?v= page. For example,

    curl -A "" -40s https://www.youtube.com/watch?v=aqz-KE-bpKQ|grep -o https://rr[^\"]*|sed -n 's/\\u0026/\&/g;/itag=22/p'
It's ironic how the author is claiming this is complicated. That's his own doing.
Interesting. I ran the script to extract the json - that part was almost instant, then i used the first `url` field of `streamingData.adaptiveFormats`. I then ran

    curl 'https://...googlevideo.com...' --output video.mp4
for me the download is throttled to "768k", i assume thats in bits per second and not bytes which is very low: the random video i tried would take 8 minutes.

on the other hand,

    yt-dlp videoIdHere
does its processing then downloads the whole thing in about 5 seconds.

Does that curl command run much faster for you? Or do you do something else?

Is 768k too slow to watch the video from the URL. If not, then I would not call that "throttled". Don't need 500 MB/s to watch a video. From what I've seen, when people discuss YouTube throttling online they are referring to max speeds of 60-70k. That's too slow to watch the video from the URL. Not too slow to download, though. And that's why this idea that YouTube is "preventing" downloads doesn't make any sense. There are download URLs in every /watch?v= YouTube page. Those are throttled. Max speed 60-70k.

Use this post-data and should get same speed as yt-dlp.

    {"context": {"client": {"clientName": "ANDROID", "clientVersion": "17.31.35", "androidSdkVersion": 30 }}, "videoId": "$x", "params": "CgIQBg==", "playbackContext": {"contentPlaybackContext": {"html5Preference": "HTML5_PREF_WANTS"}}, "contentCheckOk": true, "racyCheckOk": true}
I do not use curl, except in HN examples. I generally do not download from YouTube. I use the URLs in the JSON to watch the video.
I should add that with respect to the download URLs in the HTML of every /watch?v= page some will not work at all, namely, in the case of heavily commercialised videos, videos using DASH and some other uncommon cases. But I always found this is minority of linked YouTube videos one encounters on the web.
iOS and Android clients do not yet have URLs with the "n" parameter. This is why specifying the clientName as "IOS," along with the specific YouTube key, currently yields URLs that remain unthrottled.

However, acquiring this key requires decompiling the mobile application, monitoring requests through a proxy, or relying on values discovered by others. It's not necessarily straightforward.

I do agree that the code is simpler this way.

I also find it interesting that, by default, yt-dlp calls the YouTube API three times, initially as an Android client, then as an iOS client, and finally as a Web client. Depending on the video and certain other parameters, YouTube provides different formats to different clients.

"However, acquiring this key decompiling the mobile application, monitoring requests through a proxy or relying on values discovered by others."

This is again not true. The key is in the HTML of every /watch?v= YouTube page. It's a public key; it's not hidden in any way.

Further, it's possible, up until today at least, to use the "WEB" key with clientName "ANDROID" of "IOS" and receive unthrottled URLs. The key in the shell script is in fact the WEB key. The key for IOS is different.

   curl -40s https://www.youtube.com/watch?v=aqz-KE-bpKQ \
   |grep -o \"INNERTUBE_API_KEY...[^\"]*\"