I created a bot that would scan for private SSH keys to connect to AWS and other services, it also warned about leaked software licenses for SublimeText and other popular programs at the time. While many people appreciated the initiative, it was not taken the best way by others. Ultimately, GitHub suspended my account and I had to explain what was all about.
One year later, through my employer, I created another bot to scan for security vulnerabilities in projects written in Ruby, Python, PHP and Node.js; this time I already knew that I would need to contact GitHub beforehand to make sure what were the limits of the "automation". They simply stated that — at the time — no automation was allowed, which was quite surprising because CI is automation. Travis and other services are allowed to do things there so I didn't understand why my bot was different.
I reported to my employer that we would need to shutdown that project and move on to something different. One year later, I find that GitHub implemented a (semi) vulnerability scanner for a selected group of programming languages, warning the repository owners about problems with their software dependencies. I cannot be mad about this, it's their service, but it still made me a bit angry.
Assuming you're talking about a bot similar to the OP's that scans random projects you don't own, that's very different from CI which is explicitly configured by projects.
This really depends on how you define automation. CI on GitHub depends on webhooks which is an officially supported part of the API. So there isn't anything unsolicited happening.
I just don't think there's any meaningful claim that CI explicitly configured by maintainers using officially supported channels should be lumped in the same category as automated scraping of repositories for creating PRs.
Sure, there are many definitions of automation that would include both of these things, but I think it's obvious what GitHub intends in practice.
If it's defined in the UI, it doesn't make it unsolicited. I can go to any repository for someone I've never heard of and make a PR as a human. That is still unsolicited. I think the concern is the combination of unsolicited and automated.
IMHO there's a world of difference between automation that touches your projects and automation that touches projects of many other users, and that seems to be a key part of their criteria.
I just got one of those warnings yesterday and found it very useful. Shame you were not permitted to continue the work you had planned to do on that! It was a worthwhile idea.
Github does allow bots (they are even marked as [bot]), but the user must opt-in to them before they can do PRs or other interactions with repositories.
I assume the whole point of the rule was to stop people spamming people right? It sucks because you were trying to do the right thing but I can maybe see where they were coming from.
Build your pipeline so that a human has to approve each automated action being taken. The difference between a bot making 5000 network requests in a day and a human making 100-200 semi-automated requests isn't a whole lot in terms of throughput, but makes an enormous difference in terms of quality control and not stepping on toes.
I really wish more companies would do this. Fully automated business procedures that make demands on human attention are just plain awful. Humans should be interacting with humans, machines should interact with machines. Machines can help the human, and the human can help the machines, but interaction points should only be between two of the same types of entity.
Every time I've ever suggested human rate limiting though, I get looked at like I'm a moron. Even when I build the entire workflow myself and tune it so it only takes a relatively tiny amount of time to clean massive amounts of data, people just don't want to do it. It's beneath them. Even when it creates a massive difference in the quality of the product / service you're offering.
That's an interesting point. I'm not sure I would trust myself to do a better job than automated tests after a (short) while. It's not really a matter of being beneath me; for example, I don't mind doing repetitive manual labor (once in a while).
Just out of curiosity, what's the biggest job of this type you've personally handled?
I made a brief, aborted attempt at a restaurant recommendation service. We wanted to hydrate our data with existing pictures of dishes from the restaurants sourced from Yelp and/or Google Image Search. After looking at that data, I realized that a human touch to picking the right images would make a huge difference in the service.
We're talking thousands of restaurants that we wanted pictures of food from, each of the restaurants had dozens of images we could pull. So tens of thousands of images needed to be sifted through, I figured with the right tooling, myself and my cofounder could put together something really nice that would only need an hour or so of maintenance a day to keep up.
So I built a pipeline that used very basic and easy to build and maintain 'dumb' Rails asset pipeline pages to present data for sifting. Go to the endpoint, it shows you the name of the restaurant and a bunch of images, you select one, type in a name for the dish, and it saves it to the database and puts up another page of images.
It took me bitching up a storm to get him to even look at it. He complained about how long he thought it would take, while I just got to work. Took maybe three weeks to prototype our app. One thing I learned in the process is that if you're looking at a bunch of Southern food, for some reason the picture of shrimp and grits always looks the most appetizing.
I was well on my way to classifying and figuring out novel ways to present the data when I had to make the determination that there wasn't good cofounder fit. So now I work with CNN.
But now all my side projects revolve around ways to get human attention to improve automated tasks. I suppose one of these days I'll get the right idea and/or the right cofounder and I'll give it another go.
There's a wealth of usable information out there on the web that one can build businesses on top of if one only wants to apply a little elbow grease to clean it and turn it into data. It's far easier to scrape data with a regular web browser with a custom browser extension than to try to build out headless infrastructure. But no one wants to do it.
Your story reminds me of a tool I wrote for helping lawyers classify a feed texts as a test set for a project.
Our main initiative was creating a heuristic based classifier (think lots of regex). At my own initiative, I trained ML classifiers while we worked on it. As development went on, the ML classifiers were rapidly catching up with the heuristic based one. Unfortunately it was kind of a one off data processing task, and when time ran out the regex machine was still in the lead.
I was modestly proud of the legalese DSL generator I wrote up. The lawyers didn't even know they were writing coffeescript as they typed out what documents were, what key dates were, etc. :D
That coffeescript formed the basis of our accuracy testing suite. It was as fundamental as it was huge. That team ended up creating a couple thousand tests in less than a month.
A project I was working on was a much more useful bot. The number of affected repos for that problem was over 35,000 (although 95% of the value from fixing would be gained by only fixing the top 3,000 repos, when weighted by popularity)
The task of verifying correctness of a pull was much more time consuming than this one. Even if it only took 30 seconds to verify (optimistic), that's 290 hours. Which isn't necessarily all that much for an organization of verifiers (or Amazon Turk), but it is a lot for an individual.
Maybe that should be the cost. But perhaps some things you might be fine with letting a bot do (after manual verification of a statistical sampling, and thorough testing).
I've been working in the business automation field for over 10 years. I strongly disagree with what you wrote.
The entire purpose of automation (meaning, automation in general, not just business automation) is to take humans out of the equation for mundane and repetitive tasks[1], and have them deal only with exception scenarios and edge cases cannot be properly handled by the machine (either by design or due to system limitation).
Guess what happens when you have humans approve each and every automated action like you suggest? You defeat the purpose of automation, and users end up hating the system because mundaneness and repetition are reintroduced, except in a different context.
[1]The reason you want to do this is because the more mundane and repetitive a task is, the more likely people are to make mistakes, and mistakes can be costly. In fact they are often more costly than the labor itself.
That very much depends on the work you are automating away. For much of the business automation I agree. Invoice processing? Automate. Drip campaigns? Automate. Measuring performance of something? Automate.
But - picking the right photo for some restaurant, as GP stated in another comment? Make good UI and perform manually. Alternatively, train NN in background - but it might not make sense in a startup world where the effort for this would be prohibitively high. In the end it's the photos that matter to your business, not that fancy photo picker algorithm that took ages to develop and that you can't sell to anyone else.
>>But - picking the right photo for some restaurant, as GP stated in another comment?
That’s not a good example though because it is unlikely to be repetitive and mundane. It’s a decision most restaurenteurs make at most a few times for each restaurant.
Not in all cases, but in this particular case I think the "stepping on toes" part applies. Assuming that some percentage of repo authors do not want your proposed changes, your automated work is now now creating manual work for them to close the PRs. You either need to signal to the repo author that you have put at least as much effort into opening it as they would have to in order to evaluate and close it, or your proposed changes need to be _really_ beneficial.
I actually really like this idea. There are so many random things on github that get broken over time. The implementation though clearly is problematic and github has no choice but to block this behaviour.
I could image though a system where there was some sort of community managed github bot. Developers could submit pull request to the community service to fix common issues. Github would then run the service nightly themselves. Developers could opt-out of the service if they wanted. Something like this could be very handy for many things - security issues, typos, broken links, etc.
Github Applications exist for that. It's basically a bot you specifically opt into.
The main issue is one of discovery.
Though I imagine you could build an application to notify users of new fixer applications. Maintainers would opt into that for their accounts/repositories, it would then match repositories & applications submitted to it and ping submitters when an application looks… applicable.
> Github would then run the service nightly themselves. Developers could opt-out of the service if they wanted.
Great write up. I am a maintainer of a project you opened a PR on, but the diff is large and seems to include unrelated changes. Maybe you based of a fork and the opened the PR to upstream? If the bot had worked as intended I would have thought it was cool, but given that it opened a large PR with an incorrect description I found it harmful. I guess the problem with these bots is that it is easy for you to make a mistake which takes a lot longer for people to deal with than it took for you to make, so github needs to ban these to prevent this.
Thanks for reporting this. Firstly, I am sorry for the inconvenience of receiving a broken pull request.
I have looked into the pull request and discovered that this is a variant of "Bug #4" from the blog post.
It happens when the third-party renames their forked repo. At this point, the names don't line up and my bot doesn't realize that the two repos fork to the same location.
I have manually fixed my merge request for your repo and will be writing a script to look for others that might have had a similar experience.
I can see why this would end in suspension - you even mention yourself that when launching the bot (Even on a limited number of repos) you had bugs which messed up README's....
This sounds like a terrible idea! I wouldn't want an automated bot trying to auto-correct my work in this manner
I would actually really appreciate a bot that does that. There are so many bandges which are broken which is just annoying and many maintainers do not care about the readme once they have written it. They do accept pull requests but don’t update it manually.
And since the bot is only creating pull requests, I don’t see any harm: worst case for my repo, it would brake the readme but I double check it just like any other pull request, realize that it messed up, and fix it myself (but I would be thankful the bot noticed the broken link and I have a motivation to fix it).
What is a bit problematic about this bot would be that, due to a bug, it starts spamming (creating 1000 of pull requests) flagging false positives, etc).
Furthermore, it is also important where to draw the line. A bot that notices something is broken and offers me a fix is ok. A bot that notices I use a working service X and offers a pull requests to use service Y could be problematic because it might be useful but might also be annoying (because it is advertising, and service X might be good enough for me.
The issue with bots is never a particular bot. GitHub is just trying to avoid a situation where a large percentage of pull requests are coming from bots. This would drive people off of the platform even if every individual bot were reasonable and justified.
Yes, the bot made broken pull requests to some 35 repos before all the issues were ironed out. As a maintainer of those repos, I would be annoyed at these 'corrections'.
For my part, I manually corrected all of them and apologized to the maintainers for any inconvenience. The corrected pull requests were accepted, and the bot went on to submit correct fixes to several hundred other repos.
There's always the opportunity for bugs, but once they were ironed out it was able to happily submit correct fixes for hundreds more. I think that makes the idea worth something.
> I can see why this would end in suspension - you even mention yourself that when launching the bot (Even on a limited number of repos) you had bugs which messed up README's....
It just opens a PR though, it's not like it actually breaks anything.
The one big issue is that it's spammy.
At the same time, I wonder how you could contact maintainers to see if they're interested, I feel opening an issue would be just as spammy, to say nothing of DM-ing maintainers, or tagging them on issues in your own repository.
You'd want this sort of behaviours to be opt-in, but at the same time you'd be limited by awareness. It's not like this is a big/complex change so chances are the maintainers just don't know about the issue or how to fix it.
If you still are interested in solving the badge problem, you could also register the domain pypip.in again and redirect the URLs to the other service you found.(looks like it's available)
Maybe the other service would do it if you emailed them?
Well, if they are not benevolent, at least there will be a strong motivation to finally solve those broken badges then. ;)
I must say this made my day... It's one of those occurences when one uses many many many hours for a project that in the end could be solvable by a small amount of money ($10) and a few e-mails. Must admit I didn't think of it either when I was reading the blog post. :)
Removing already-merged pull requests from display seems adjacent to data integrity issues. GitHub should probably be doing more QA in this area. They've done a lot to improve their security and protect from DDOS so I'm sure they can. I hope someone there will notice it (through this comment or through the blog post) and this will be a wake up call.
We let Greenkeeper update our JavaScript repos. We still have to accept and merge its pull requests. So there are ways to do this where GitHub won't shut you down (if you are Greenkeeper)...
Maybe a solution would be for someone to create an app like Greenkeeper, but which promised to start by doing five things, but to add more over time, informing you of each new thing and letting you opt out at any time: a list of checkboxes.
This is exactly why GitHub will never be the monorepo (as described in Google’s paper https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...). The world needs an ability to do code repairs and refactoring at GitHub-scale and yet probably some people just blocked this person’s bot for spam.
GitHub doesn't seem to have an ambition as for being the world’s monorepo either, the features they’ve been building is not usually in line with that. I think GitHub should consider creating a team of people that think about the next 10-20 years of open source development and how THEY can carry the flag in terms of innovation in being a world-scale code repository.
The guy running the show at the Appimage organization routinely runs a bot that automatically creates thousands of issues asking maintainers of random repositories to create Appimage builds, or if there already is an Appimage build it asks them to comply with good Appimage practices, or to fix paths or icons, and so on. Why can they do that with no problems while this less known but more useful bot can't? According to the description here, the bot was useful because it was fixing minor issues in an automated, easy to integrate way with no reasonable downsides to maintainers (other than having to eyeball the pull request and merge it).
Honest and noble intentions to be sure. But the author should have foreseen the consequences, or at least investigated similar services to see how they behave (and what Github and users typically consider acceptable). Greenkeeper.io for example, provides a very similar service for Node.js package dependencies - but it's opt-in, as Github support was quoted as mentioning in the article.
All one needs to do is take a step back and take stock of how many real-life situations we find unsolicited anything acceptable, and the real potential for pitfall would've been clear.
This is a great opportunity to remind everyone that http://shields.io/ has been providing high quality SVG badges for all your repo metadata needs for 5 years now.
Even if Shields doesn't support the specific third-party service integration you're looking for you can generate a badge using the incredibly simple image API:
Would this also violate the terms of services to create an issue automatically, to explain what is broken. And include a link in the issue text to automatically create a PR via the bot? Then it would be an opt-in situation. Or is it just frowned upon to even do that?
Hey Michael, your blog post isn't finished, what is the "much more complicated (and useful) bot that I was working on" supposed to be, if it's not a secret will you let me know, it's like watching a movie with no ending.
You absolutely deserve to be banned, regardless of ToS. You're wasting thousands of other people's man-hours to demand, however politely, that they fix a miniscule error that likely only bothers you. Even if you provide the fix with the request, it will still take a colossal amount of time to read and review the PRs. This is mechanised pedantry on a despicable scale.
The total amount of time to review one of these pull requests can be estimated at < 30 seconds per repo. 30s * 1000 repos = 8h20m. Hardly thousands of hours.
But yes, this was a particularly trivial problem to tackle. It was meant to be a stepping stone to a truly useful bot that I was working on. However, that has been put on hold.
FWIW, the maintainers themselves gave lots of positive feedback on the project.
I don't mind people doing PRs for these little things but absolutely hate when bots do it. We have a ton of public repos, many of which are old and abandoned (although not as bad now that you can archive them). Getting emails for each of them is annoying to say the least.
I think that the author is taking away the wrong lesson, because I think they started with some wrong ideas about communication. If you read the GitHub ToS, the relevant policy statement says "excessive automated bulk activity" not just "automated bulk activity". I bet someone complained about them, and I think it happened because they thought it would be bad to make the bot act like a person.
If your bot has output, always make your bot act like a person. That means messaging, and that means timing. Even in the best case, if your bot uses few resources and always perfectly does the right thing, people don't like bots.
> There are four very important things that any automated message needs to do in order to help avoid aggravating people: Be Accurate and Useful, Be Honest/Open about being a bot, Have a mechanism for feedback, Be Friendly
No. God no. There are two important things that you need in a PR: Don't act entitled (op did a great job there) and don't waste my time (op failed hard at this). Everything else is bad. Telling someone that your pull request is coming from a bot only hurts your goal. In the absolute best case, they treat your PR like any other. In many cases, though, knowing that a message was automated will get you instantly reported for spamming regardless of how helpful you were.
> Automated messages should describe themselves as such.
This is off topic and therefore violates the "don't waste my time" principle. It also has a tendency to engage the gag reflex.
> It should be the opening line.
Having multiple lines for something so small violates the "don't waste my time" principle. And definitely don't start your message with something that is off topic.
> announcing it as automated helps explain why they are receiving the pull request
They are receiving the pull request because something is broken and you are fixing it for them.
> Have a mechanism for feedback
They can put feedback on the PR. This violates the "don't waste my time" principle.
> I ended up settling on the following message for the pull requests:...
Holy crapballs that's verbose. This definitely violates the "don't waste my time" principle.
"Fix broken badge by pointing to working URL foo [see: bug_report_link]". Boom. Done. It's easy to read, easy to understand, and easy to approve.
> Note that the last paragraph is only included in the message if the README includes the “download count” badges. I debated working out a system to delete these badges automatically
You should have either skipped them or maybe filed an issue instead. "The download count badge in README is broken because the foo API no longer exists". Not a whole paragraph.
> Do not make automatic unsolicited pull requests.
Most pull requests are unsolicited, and GitHub has an automation API for pull requests, and their ToS doesn't prohibit unsolicited automation, just "excessive" such, so this is probably the wrong takeaway.
I created a bot that would scan for private SSH keys to connect to AWS and other services, it also warned about leaked software licenses for SublimeText and other popular programs at the time. While many people appreciated the initiative, it was not taken the best way by others. Ultimately, GitHub suspended my account and I had to explain what was all about.
One year later, through my employer, I created another bot to scan for security vulnerabilities in projects written in Ruby, Python, PHP and Node.js; this time I already knew that I would need to contact GitHub beforehand to make sure what were the limits of the "automation". They simply stated that — at the time — no automation was allowed, which was quite surprising because CI is automation. Travis and other services are allowed to do things there so I didn't understand why my bot was different.
I reported to my employer that we would need to shutdown that project and move on to something different. One year later, I find that GitHub implemented a (semi) vulnerability scanner for a selected group of programming languages, warning the repository owners about problems with their software dependencies. I cannot be mad about this, it's their service, but it still made me a bit angry.