Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?

Y	Hacker News new \| ask \| show \| jobs

	Ask HN: Is there a Hacker News takeout to export my comments / upvotes, etc.?
	45 points by thyrox 924 days ago
	Like the title says wondering if there is an equivalent of Google takeout for HN? Or how you guys are doing it? Thanks.

8 comments

zX41ZdbW 924 days ago

You can export the whole dataset as described here: https://github.com/ClickHouse/ClickHouse/issues/29693

Or query one of the preloaded datasets: https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

    curl https://clickhouse.com/ | sh

    ./clickhouse client --host play.clickhouse.com --user play --secure --query "SELECT * FROM hackernews WHERE by = 'thyrox' ORDER BY time" --format JSON

link

verdverm 923 days ago

This does not include the user's private data, which looks like what OP is after as well

link

Jugurtha 923 days ago

Here's a small, crude, Scrapy spider, with hardcoded values and all. You can set the value of `DOWNLOAD_DELAY` in `settings.py` for courtesy. It puts the comments in a `posts` directory as `html` files.

It doesn't do upvotes nor stories/links submitted (they have the type `story` in the response, as opposed to `text` for comments). You can easily tweak it.

  from pathlib import Path
  
  import scrapy
  import requests
  import html
  import json
  import os

  USER = 'Jugurtha'  
  LINKS = f'https://hacker-news.firebaseio.com/v0/user/{USER}.json?print=pretty'
  BASE_URL = 'https://hacker-news.firebaseio.com/v0/item/'

  class HNSpider(scrapy.Spider):
      name = "hn"
  
      def start_requests(self):
          submitted = requests.get(LINKS).json()['submitted']
          urls = [f'{BASE_URL}{sub}.json?print=pretty' for sub in submitted]
          for url in urls:
              item = url.split('/item/')[1].split('.json')[0]
              filename = f'{item}.html'
              filepath = Path(f'posts/{filename}')
              if not os.path.exists(filepath):
                  yield scrapy.Request(url=url, callback=self.parse)
              else:
                  self.log(f'Skipping already downloaded {url}')
  
      def parse(self, response):
          item = response.url.split('/item/')[1].split('.json')[0]
  
          filename = f"{item}.html"
          content = json.loads(response.text).get('text')
          if content is not None:
              text = html.unescape(content)
              filepath = Path(f'posts/{filename}')
  
              with open(Path(f'posts/{filename}'), 'w') as f:
                  f.write(text)
                  self.log(f"Saved file {filename}")

link

gabrielsroka 923 days ago

I cleaned up the code a little bit, but I didn't test it. This will have the same limitation as the Python I posted earlier in that you're not authenticated.

  from pathlib import Path
  
  import scrapy
  import requests
  import html
  import json
  import os
 
  # Set this:
  USER = 'Jugurtha'  
  
  BASE_URL = 'https://hacker-news.firebaseio.com/v0' # https://github.com/HackerNews/API
  LINKS = f'${BASE_URL}/user/{USER}.json'
 
  class HNSpider(scrapy.Spider):
      name = 'hn'
  
      def start_requests(self):
          items = requests.get(LINKS).json()['submitted']
          for item in items:
              url = f'{BASE_URL}/item/{item}.json'
              filepath = Path(f'posts/{item}.html')
              if os.path.exists(filepath):
                  self.log(f'Skipping already downloaded {url}')
              else:
                  yield scrapy.Request(url=url, callback=self.parse)
  
      def parse(self, response):
          item = response.url.split('/item/')[1].split('.json')[0]
  
          filename = f'{item}.html'
          content = json.loads(response.text).get('text')
          if content:
              text = html.unescape(content)
  
              with open(Path(f'posts/{filename}'), 'w') as f:
                  f.write(text)
                  self.log(f'Saved file {filename}')

link

gabrielsroka 923 days ago

I wrote a JS one years ago. It still seems to work but it might need some more throttling.

https://news.ycombinator.com/item?id=34110624

Edit: I see I added a sleep on line 83 a few years ago.

Edit 2: I just fixed a big bug, I'm not sure if it was there before.

Edit 3: I wrote a Python one, too, but I haven't tested it and it most likely needs to be throttled. It's also not currently authenticated so only useful for certain pages unless you add authentication.

https://github.com/gabrielsroka/gabrielsroka.github.io/blob/...

link

westurner 924 days ago

There are few tests for this script which isn't packaged: https://github.com/westurner/dlhn/ https://github.com/westurner/dlhn/tree/master/tests https://github.com/westurner/hnlog/blob/master/Makefile

Ctrl-F of the one document in a browser tab works, but isn't regex search (or `grep -i -C`) without a browser extension.

Dogsheep / datasette has a SQLite query Web UI

HackerNews/API: https://github.com/HackerNews/API

link

verdverm 924 days ago

https://gist.github.com/verdverm/23aefb64ee981e17452e95dd5c4...

Fetches pages and then converts to json

There might be an HN API now. I know.theyve wanted one and I thought I might have seen posts more recently that made me think it now exists, but I haven't looked for it myself

link

krapp 923 days ago

Hacker News has had an API since 2014[0]. It can be found via the "API" link at the bottom of the page[1].

[0]https://www.ycombinator.com/blog/hacker-news-api

[1]https://github.com/HackerNews/API

link

verdverm 923 days ago

That's a read only, unauthenticated API, correct?

In other words, it does not show how to get up votes for a user, which is only visible to them.

link

mooreds 924 days ago

Nothing out of the box.