| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by me_again 1442 days ago
	What I would really like is a little bit like this but not quite the same: full text search over everything I have ever seen on the computer. It would read and index the emails, web pages, word docs, etc as I open them, then later when I think "I know I saw a doc about cache oblivious algorithms", I can search for it without being distracted by 100K documents I haven't seen. Or I can find that email I read, without finding the same phrase in a bunch of junk mail I never opened. Does anything remotely similar exist?

14 comments

bmn__ 1442 days ago

The pieces exist, you can string them together with a Perl one-liner. You are interested in the set intersection of the following two topics:

Full indexing: <https://lesbonscomptes.com/recoll>, <https://userbase.kde.org/Akonadi>, <https://addons.mozilla.org/firefox/addon/falcon_extension> (If you're not content with a piece, then research substitutes on <https://alternativeto.net>.)

Recent: `.local/share/recently-used.xbel`

This does not help with the email part because email programs do not register opened messages in recently used. Work-around: install a DBus or AT-SPI hook and write your own database of recently opened messages.

Happy hacking!

link

shishironline 1442 days ago

Thank you for sharing this

link

isaacimagine 1442 days ago

I've seen this been called a 'personal search engine' before. One person who is well known for their personal search setup is thesephist[0]. My friend is also working on an extension that uses NLP to semantically index your browsing history so that any text on the internet can be turned into a hyperlink to something else you've read[1].

[0]: https://thesephist.com/posts/monocle/

[1]: (WIP) http://espial.uzpg.me

link

suby 1442 days ago

I've read comments from people (don't remember the forum, perhaps HN) where people have said that they did this. No idea if there's a public project for this that you can use, but people have definitely done it. I agree that it'd be useful to have, though you probably need a good way to filter out irrelevant stuff.

link

AB1908 1442 days ago

Try looking at karilicoss' promnesia and it's background for similar ideas and tools.

link

ryanfox 1441 days ago

I’ve been working on exactly that! [0]

My info is in my hn profile, if you (or anyone reading) would like to chat about it.

[0] https://apse.io

link

DocTomoe 1442 days ago

For Windows, Google used to have something like that. Because it's Google, it has since been discontinued [1].

Mac's finder is close to what you have described, and works reasonably well for me.

On Unix, this sounds like something a grep one-liner (maybe with some document depacking/packing pipe for Office documents) would do.

[1] https://en.wikipedia.org/wiki/Google_Desktop

link

applgo443 1442 days ago

I considered doing this - take screenshots of your screen constantly, OCR them and index them. It's fairly simple. However, there are some problems

- OCR constantly running in the background is power consuming - What granularity do you take your screenshots? Imagine each screenshot is 500 Kb and you take one each second. This'd result in 40 gigs of data per day. How are we gonna store it? How many days data do you want to keep?

link

billwashere 1442 days ago

That's Apse – A Personal Search Engine https://news.ycombinator.com/item?id=27965979

link

nly 1442 days ago

Privacy?

link

capableweb 1442 days ago

Since parent is taking power consumption and disk storage into consideration, it's fair to assume they are considering a local approach, meaning privacy is as good/bad as any other local data you have on disk today.

link

aastronaut 1442 days ago

There was once a thread here on hacker news about missing features of operating systems, as such a thing could only be achieved on OS level... can't find it anymore, unfortunately. It was mentioned that passwords etc. could be a nightmare. A feature like that would be a big dream of mine: Some kind of an individualized semantic archiving processing and a vector search engine to search through it.

link

bitL 1442 days ago

Open-text question answering. Just make your own; index all paragraphs of all documents using TF-IDF as you access them, then when trying to search for something, use this index to get a set of candidate paragraphs and run them through BERT-QA trained on SQuAD v2. You can extend this to the content of images - first run image captioning using CNN and transformers, then index the resulting paragraphs the same way (in both cases, include a link to the original in the metadata). You might need to write some browser plugin/system driver to do it automatically as you access documents/images.

link

dirkc 1442 days ago

There used to be a project that kind if did this: https://beagle-project.org/. It's long since defunct and I'm always surprised that nothing emerged to fill the gap?

EDIT: I did a bit of Wikipedia rabbit holing only to discover that tracker [1] is currently running on my computer and indexing my files

[1]: https://en.wikipedia.org/wiki/Tracker_(search_software)

link

solardev 1442 days ago

I think windows and Mac both do this by default, no? Just disable the web search and your local full text is what you're left with.

link

sneak 1442 days ago

It doesn't index text in local image files, and it doesn't index over the full text of all the webpages and epubs I've read.

link

The5thElephant 1442 days ago

I believe they meant they want something that searches only content the user has personally directly accessed, not ALL local content.

link

ricardobeat 1442 days ago

Spotlight on Mac will show you recently opened, or frequently opened, files first.

link

invalidusernam3 1442 days ago

Ordering by "Date Last Opened" or "Date Modified" does a fairly good job in some cases

link

theK 1442 days ago

This does introduce sorting contention though, what to sort for first? relevance or date accessed? Ideally you would want to introduce date accessed as an aspect of relevance itself.

link

fnord123 1442 days ago

I thought we all disabled access time (mounting with noatime) to avoid trashing SSDs so quickly.

link

rhn_mk1 1442 days ago

For the web part, there's a tool called Recoll, and a browser plugin Recoll-we.

link

joshu 1442 days ago

this has been built before. the problem is that it also needs attention for ranking

link

2Gkashmiri 1442 days ago

you know,,,, there was a april fools day annoucement on torrentfreak years ago, maybe a decade, it was describing this behaviour. that was nice

link