That's not an accurate assessment of what's going on. The post linked states "no contact with the maintainers" not just inactivity, and if you follow the github issues linked, it's about people wondering what's going on and if anyone else can approve pull requests because there's lots of pull requests waiting. There's 843 pull requests at this time, and I just looked and over 50 are from just the last month.
It's not that there's repo inactivity, it seems to be that this is an extremely active repo which saw everything grind to a halt when the admins went dark. That's quite a bit different then just "inactivity".
> There's 843 pull requests at this time, and I just looked and over 50 are from just the last month
That's kinda overwhelming though ... imagine that if the maintainer pops up somewhere, suddenly 100 motivated people may chime in "hey please review this important pull request that's been sitting over here for a while".
There are some kinds of open source projects that are prone to this ... some are really not so bad to maintain if you have the right kind of discipline, because they converge on a stable set of functionality and platform compatibility evolves slowly, but some just naturally have endless room for variations and special cases, and as users increase, PRs increase linearly (instead of sub-linearly as you'd hope). I'm thinking in particular of https://github.com/oauth2-proxy/oauth2-proxy (of which I contributed to an older fork)
Months before the lawsuit, youtube-dl's maintainers frequently closed issues that reported ongoing breakage without giving a reason. Here is one especially illuminating example:
All code in this project is licensed solely with the condition that any portion of it is not permitted to be used in the main youtube-dl fork, either directly or indirectly. It is also not permitted to be used in any project that contains contributions from either remitamine or dstftw.
The two users mentioned are or were previously major contributors to youtube-dl.
It seems that youtube-dl was already a dysfunctionally managed project at the time of the lawsuit and happened to ride out on the good PR for a couple of months, before returning to stagnation once again.
To me it sounds like a plugin system would have prevented centralization and the need for forks, but would have made distribution harder for average users.
Indeed, this has been years in the making. Maintainer activity has been slowly dwindling while would-be contributors were driven away by the maintainers’ lack of communication and abrasiveness. I myself have had pull requests languishing there for years with nobody bothering to review them. Other people had their issues closed with no explanation. It was just a matter of time. Good thing that the forks have sprung up some time before upstream development halted entirely.
2 months is a long time in youtube-dl world. It's not really a "software project" in the traditional sense, where you can stop working on it once it's "done". It's more of a "social project", a focal point for the required ongoing activity necessary to keep sites working. Youtube-dl without daily commits is useless.
I'm currently using it to download a youtube video. If it still works for it's main function, maybe it's just not a high priority project until something breaks?
It's actually already broken. If you try to download more than 3-16 videos (the limit is not clear), you start to be rate limited to 300 kbps or so. According to Reddit this is fixed in a fork called yt-dlp
To be honest that sounds fine. You're presumably downloading it to watch it offline, the download speed isn't really material as long as it finishes eventually
Primary function, perhaps, but it's used for lots of websites that aren't YouTube, and some of those websites have broken.
There are open pull requests and/or bugs for many websites that aren't being approved. This is rather unusual for youtube-dl, it normally had a release ever two weeks or so.
For a project like youtube-dl it is a long time, because they use unofficial APIs (fancy word for scraping) of video sites that can shift even on daily basis. If you look at their Github issues it is just people endlessly complaining that some websites are broken again
That used to be true, but today, with so many websites operating as SPAs against undocumented APIs, I think it's reasonable to redefine "scraping" to mean extracting data from unofficial APIs in addition to extracting it by parsing HTML.
After all, what is a scrapeable HTML page if not a grotesquely convoluted undocumented API with an unstable output format?
Scraping refers specifically to extracting data from a format designed to be read by humans instead of machines.
The gross inefficiency and low data-to-layout ratio are the key things being expressed through connotations of the word "scrape". To scrape is to extract a small amount of something from a much larger substrate.
To call every query a scrape is to diminish the specificity and utility of the term.
If an unofficial API returns JSON that looks like this:
{
"id": 3422,
"title": "My essay about cheese",
"published": "13th August 2021 at 3:45pm",
"abstract": "<p>In which I write about cheese!</p>"
}
And I write code against that which includes stripping the HTML tags from "abstract" and converting the date format in "published" into in ISO datetime... am I writing scraping code?
I would argue that I am, even though it started out as a JSON wrapper.
"To call every query a scrape is to diminish the specificity and utility of the term."
Absolutely disagree with you there. I interpret the term "scraping" as "writing code that gathers data from a source that has not deliberately published that data in a usable format". Gathering data from any kind of API fits that criteria for me, since most APIs only give you a subset of the data at a time.
I think the reason I care so much about this is that I coined the term "git scraping" to cover a variant of scraping that uses Git repositories to store the data and track changes over time - and git scraping applies equally to data sourced from APIs as it does to data sourced from HTML pages. https://simonwillison.net/2020/Oct/9/git-scraping/
If everyone insists that is what it means for long enough, then that is what it will mean.
The term was coined to differentiate how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines. If that meaning erodes, and it's just yet another way to say an API query, it will be a great loss for the precision of our terminology.
Sounds like an exhausting thing to maintain. It's not like writing scraping (or even just changing slight variations in an API) is terribly interesting.
True, but it’s also the kind of product that’s instantly useful and itch-scratchy. Youtube-dl not working on the video you’re downloading today? Well if you’re a maintainer you can just patch it yourself. (Non-maintainers can too of course, but I imagine the maintainers have the know-how to actually fix things)
Having contributed to youtube-dl in the past, long turnaround times from the maintainers was pretty normal. I've had (and still have some) open PRs that have been ready to merge for going on a year. The two months is really not that big of deal.
That being said, the project probably could use some reorganization. It requires a lot of community contributions to keep all the extractors maintained so long turnaround times for reviews isn't ideal.
Also, in common with other massive online properties for instance Amazon, not consistently: changes sometimes roll out a bit at a time so users in different areas get different versions, either due to global roll-out being a staged process or because a UI experiment is being performed. It usually doesn't matter to a well written scraper as the core data is still accessible in the same way despite the UI sugar coat having changed, but sometimes there are significant enough changes under the hood too that the scraper needs to deal with while still supporting the older format(s).
I guess it depends on if sub-2-month feature development is needed to keep up with YouTube's changes or not.
I maintain a GCC code coverage tool on GitHub, and since GCC doesn't change very often and the feature set of the tool is fairly complete, I sometimes go 6+ months without commits. Usually I don't touch it unless someone opens an issue.
dark thought but I've had multiple instances in last year+ where someone hasn't posted in awhile, tweeted etc and I wonder for a second, maybe they died? The pandemic is real. And random ppl disappearing unexpectedly is part of it.
I hope all is well.
It's not that there's repo inactivity, it seems to be that this is an extremely active repo which saw everything grind to a halt when the admins went dark. That's quite a bit different then just "inactivity".