| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lourot 2839 days ago
	OP again, we're way overloaded. We got 200 profile requests already in an hour and we didn't see it coming. We're extremely thankful and sorry to disappoint you at the same time. We're closing the "Get your profile" feature for now. Spinning up many EC2 instances to process the 200 profile requests faster in parallel as we speak. And we'll come back soon with a system that can handle this load. Many many thanks!

4 comments

KenanSulayman 2839 days ago

Isn't 200 requests in an hour ... just three requests per minute? What is this service doing?

link

brillout 2839 days ago

GitHub's API does't provide the exhaustive list of all your contributions. Instead we crawl GitHub's website which is slow.

Details at https://github.com/AurelienLourot/github-contribs#how-does-i...

We are in talks with GitHub and they know that we are crawling GitHub.

link

ChristianBundy 2839 days ago

> Instead we crawl GitHub's website which is computationally expensive

I'm surprised to hear that you're bottlenecking on CPU time. Could you verify that my understanding is correct? I would've thought your bottleneck would be networking and connectivity as you have to wait for GitHub to process all of the requests.

link

brillout 2839 days ago

You're right. I edited my answer.

The GitHub's unofficial API we are using is slow and per IP rate limited. We spin up several servers to have several IPs to circumvent the rate limit.

(GitHub knows that we do that and we are in contact with them.)

link

exikyut 2838 days ago

If GitHub are okay with you using multiple IPs to get that data then it's not inherently expensive on their side for you to be using this.

Surely a rate-limit exception could be in order, then?

And perhaps you could help them alpha-test a new API endpoint that just so happens to include all the info rolled up as one URL :D

(Hmmmmm.... GraphQL....)

link

rjacksonm1 2839 days ago

Looks like they're hitting a GitHub rate limit, rather than bottlenecking on CPU.

link

lourot 2839 days ago

this is the bottleneck: https://github.com/AurelienLourot/github-contribs

It uses a very slow non-official GitHub API and so it takes several hours to do the initial crawling of one single profile, and is limited by IP so you need several machines/instances in order to parallelize. We plan to use AWS Fargate for the future. (We thought this future would be much farther away)

link

iokanuon 2839 days ago

Maybe it's cloning all GitHub repos of a user to compute activity?

link

brillout 2839 days ago

We don't clone repos. Instead we use GitHub's API to get activity info.

E.g. https://api.github.com/repos/aurelienlourot/ghuser.io/contri...

link

sam0x17 2839 days ago

Would it be practical to use puppeteer / headless chrome from within a Google Cloud Function to do this? Then you could do literally millions of profiles per hour. The only requirement is that it never takes more than 540 seconds for any one profile, and that you could work around if you write in the ability to export the execution state and send it off to a new function invocation if you are close to the deadline. This seems like a much easier problem if you can get out of VPS land and into lambda/cloud functions since its something like $0.00001 per invocation.

https://cloud.google.com/blog/products/gcp/introducing-headl...

link

gitgud 2838 days ago

That's amazingly cheap, but would each puppeteer instance have its own IP address? They're being rate limited on each IP.

link

sam0x17 2838 days ago

I would think you get a random IP as they have millions of worker nodes, but I don't know.

link

sethish 2839 days ago

You may want to delete my profile request, I have somewhere between 50k and 60k repositories and it tends to crash services. (user: sethwoodworth)

link

lytedev 2839 days ago

Haha! That is amazing! How, exactly, do you have so many?!

link

ishanjain28 2839 days ago

Can we clone the repo and run it locally on my own profile? I arrived late..

link

brillout 2839 days ago

Not currently but we are thinking about it, see https://github.com/AurelienLourot/ghuser.io/issues/103

link