Hacker News new | ask | show | jobs
by supz_k 883 days ago
This is cool! I recently worked on integrating something like this into our blogging platform [0] to help bloggers monitor their links automatically. One main problem is with popular websites which have pretty aggressive bot prevention mechanisms. They often return 5xx codes even in HEAD requests. How do you combat that?

[0] https://blogs.hyvor.com

1 comments

I had to write a piece of logic that sends a GET when HEAD fails, because of a non-HTTP compliant servers. HEAD should in theory return the same status as a GET, but without any body. In practice, many web servers return 404, sometimes 500. HN itself returns 405 Method Not Allowed, which makes some sense, I guess.

The code in question, simplified:

    defp try_get_request?(%Link{method: :head}, %StatusResponse{} = response) do
      # Per HTTP specification, the HEAD response should be functionally equivalent to a
      # GET, but shall not contain any body.
      # Not all servers respect this, so might have a different status response on HEAD than on GET.
      #
      # We assume that some HTTP status codes are suspicious and worth retrying.
      #
      # HTTP 520 is seen with Cloudflare to mean "Web Server Returned an Unknown Error", possibly
      # in response to a HEAD request.
   
      response.code in [403, 404, 405, 520]
   end