| HN Mirror

I'm a CS grad student, by the way.

Looking for relevant papers involves sifting through a LOT of chaff. For search results, I tend to want focused density in my results, and I want to do as little work as possible to get it.

  * Scannable

  * Enough context to establish possible relevance

  * An easy way to obtain the fulltext of the paper and a .bib entry

As far as scannability...

  * I'd rather scroll than click.

  * I'd rather not scroll than scroll.

The more info I can easily read on each screen, the better, and I want action links with the search result itself. Clicks that go to other pages or sites require leaving a trail of tabs open in the browser to avoid losing the search context. So don't assume that if someone clicks on a link that they want it to open in the same window. I want to whip through all the garbage as fast as possible, and every click and animated expanding box makes that harder.

Part of the issue is that search results are only a small part of the paper-finding process, and the poor quality of most results (as well as text buried in PDFs) means that a lot of additional steps are required to assess relevance. I've tried Zotero but don't like being trapped inside it, so have developed my own workflow for capturing and assessing papers:

First, every paper I download gets a unique identifier that is easy to recreate from the paper's metadata, so I can figure out what it is just from a printed hardcopy. The code is similar to the one that Google Scholar used to generate, slightly extended to improve uniqueness. It's not perfect, but I think I've had only three collisions during the time I've been using it.

Second, the paper is saved as CODENAME.pdf in a papers directory, and possibly symlinked to a project directory. I've got a greasemonkey script to automatically route appropriate sites through my university's ezproxy, but the slight differences between IEEE, ACM, and Springer are constantly annoying.

Third, a BiBTeX entry (with the code as the identifier) is appended to a master .bib file. Google Scholar's BibTeX entries are often incomplete, so getting them from the publisher's site is much preferable. Bad entries still creep in, and have to be cleaned up later if it ends up being used as a reference.

Fourth, an entry for the paper is created in an appropriate .org file, keyed with the code. Notes will later be transcribed, and keywords appended.

That's the trawling process. Later, I'll go back and actually sort through all the papers pulled in, to determine whether or not they're really relevant, or might be relevant to another project. This process can either be using a PDF reader (which is painful) or using a large pile of actual hardcopy printouts (which is painful). On Linux, I've yet to find a good way to annotate PDFs, so hardcopy is actually the most useful. As each paper is assessed, I use different colored highlighters to mark the most relevant bits, particularly references that I want to chase (which, for example, get marked with red highlighter). A quick assessment of the value of the paper is scrawled across the front page, along with its code. If it's determined to be irrelevant, a paper can be discarded at any point in the process.

Highlighted references are chased during another trawl. Each reference has to be entered by hand into Google Scholar, since it doesn't let you surf the reference chain directly. (MSR's fancy bits are Silverlight-based, so I've never used it much.) At this point, I'll have the knowledge to guess whether other works by the same author might be relevant, and at this point I'll do author-specific searches, or search for later papers by other authors that cite an interesting one.

Good surveys are of particular interest, if they can be found, as they're likely to have a high density of good references as well as to be cited by other researchers working in the same area. Often, I'll want to chase down a large proportion of the cited papers in a good survey. If particular conferences or journals are found that are highly relevant, slogging through the ToCs on the publisher's website is often another way to find useful connections.

I prefer an assembly-line approach: I don't want to actually read papers while trawling; I don't want to chase references while reading.

If I click on a paper title in the search results, the most important thing I want to see on the next screen is not the paper title; it's everything else about the paper that will let me figure out how much additional attention it's worth to me. If I've deliberately looked up the paper, that's when I want to surf a citation graph, or explore other works by the same author.

The process is very messy and only partially automatable. But, any new search site would have to provide a lot of value relative to Google Scholar in order to result in a real improvement to the overall workflow.