curl https://clickhouse.com/ | sh
./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON
Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.
It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.
from pathlib import Path
import scrapy
import requests
import html
import json
import os
USER = 'Jugurtha'
LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty'
BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/'
class HNSpider(scrapy.Spider):
name = "hn"
def start_requests(self):
submitted = requests.get(LINKS).json()['submitted']
urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted]
for url in urls:
item = url.split('/item/')[1].split('.json')[0]
filename = f'{item}.html'
filepath = Path(f'posts/{filename}')
if not os.path.exists(filepath):
yield scrapy.Request(url=url, callback=self.parse)
else:
self.log(f'Skipping already downloaded {url}')
def parse(self, response):
item = response.url.split('/item/')[1].split('.json')[0]
filename = f"{item}.html"
content = json.loads(response.text).get('text')
if content is not None:
text = html.unescape(content)
filepath = Path(f'posts/{filename}')
with open(Path(f'posts/{filename}'), 'w') as f:
f.write(text)
self.log(f"Saved file {filename}")
I cleaned up the code a little bit, but I didn't test it. This will have the same limitation as the Python I posted earlier in that you're not authenticated.
from pathlib import Path
import scrapy
import requests
import html
import json
import os
# Set this:
USER = 'Jugurtha'
BASE_URL = 'https://hacker-news.firebaseio.com/v0' # https://github.com/HackerNews/API
LINKS = f'${BASE_URL}/user/{USER}.json'
class HNSpider(scrapy.Spider):
name = 'hn'
def start_requests(self):
items = requests.get(LINKS).json()['submitted']
for item in items:
url = f'{BASE_URL}/item/{item}.json'
filepath = Path(f'posts/{item}.html')
if os.path.exists(filepath):
self.log(f'Skipping already downloaded {url}')
else:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
item = response.url.split('/item/')[1].split('.json')[0]
filename = f'{item}.html'
content = json.loads(response.text).get('text')
if content:
text = html.unescape(content)
with open(Path(f'posts/{filename}'), 'w') as f:
f.write(text)
self.log(f'Saved file {filename}')
Edit: I see I added a sleep on line 83 a few years ago.
Edit 2: I just fixed a big bug, I'm not sure if it was there before.
Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication.
There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself
Or query one of the preloaded datasets: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...