Hacker News new | ask | show | jobs
by throwaway12357 4031 days ago
> that’s disappointing, since a computer running the existing algorithm would take 1,000 years to exhaustively compare two human genomes.

I did a quick googling [1] and

> Our algorithm divides the problem into independent `quadrants' ...

> Our results show that our GPU implementation is up to 8x faster when operating on a large number of sequences.

It's still soul crushing. Why did our genome have to be that long :(

BTW, do you have numbers for setups with hundreds of GPUs?

I'm also left wondering about results using stochastic solutions. On how accuracy and problem size relate.

[1] http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=tru...

2 comments

I'm confused by this. 1,000 years seems a bit steep to me.

Suppose you have two files, `bob.genome` and `mary.genome`. Let's say they are 1gb each [1].

I think I can diff two 1gb files in less than 1,000 years.

diff(1) shows "deletions, insertions, and substitutions".

Therefor, I don't believe it. Yet. What did I miss?

1. http://stackoverflow.com/questions/8954571/how-much-memory-w... (Rounded up because Fermi estimation [2].)

2. https://what-if.xkcd.com/84/

> diff(1) shows "deletions, insertions, and substitutions".

diff(1) doesn't give you a _minimal_ set of edits to apply to go from one file to the other, just _a_ set of edits.

Also, I think he's picturing two almost-equal files. In that case the average running time should be way lower, no? (I believe the quadratic time is worst case)
> two almost-equal files. In that case the average running time should be way lower, no?

Yes, indeed! What you're thinking of is "output-sensitive" algorithms. There are some output-sensitive algorithms for edit distance. The fastest one I found is in "Improved Algorithms for Approximate String Matching" by Dimitris and Georgios Papamichail. They note:

"We designed an output sensitive algorithm solving the edit distance problem between two strings of lengths n and m respectively in time O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance between the two strings."

http://arxiv.org/abs/0807.4368

Cool. The naive approach yields O(exp(s)*n), this looks like a nice improvement.
Yes, but complexity theorists are mostly interested in worst case analysis I believe (which I think is rather unfortunate) -- so the quoted numbers are probably what you get from plugging in 100,000^2 * dt or so.
It's also got a lower quadratic bound, it has to fill in the whole matrix of m*n cells, at least for the simple implementation. There may well be some optimisation for almost identical strings.
1,000 years? Really?

Isn't it more likely that somebody misquoted "slow, like a half an hour" as "slow, like a THOUSAND YEARS"?

  >>> 1000 / ((1024. ** 6) / 3e9 / 86400 / 365)
  82.0593593076069
The algorithm is quadratic in the input size. For a Gigabyte of data, that's 1024^6 operations. Dividing that by 3 * 10^9 operations/second (assuming a 3GHz CPU), 86400 (the number of seconds in a day), and 365 (the number of days in a year), we obtain the runtime (in years) assuming that comparing a single byte takes exactly one operation. Dividing 1000 by that number, we get ~82 operations to compare a single byte, and that doesn't look unreasonable.
They're quoting exponential (2^N), not quadratic (N^2) time.

If on some machine a quadratic-time algorithm took, say, a hundredth of a second to process 100 elements, an exponential-time algorithm would take about 100 quintillion years.

That is yet another section of the article, the thousand years clearly reference the edit distance, which is quadratic.
from the description of "a grid ... flooding diagonally" I think it's not comparing two 1gb strings it is comparing every substring of those strings

  for i := 1..bob'length loop
    for j := i..mary'length loop
      editdistance(substring(bob,1,i), substring(mary,i,j));
    end loop
  end loop
>BTW, do you have numbers for setups with hundreds of GPUs?

I saw a talk on it a while ago, I can only remember they were using CUDAlign and Smith-Waterman (the basic idea is the same). Doing some googling too this seems to be a reasonably recent work with GPUs and CUDAlign (DOI 10.1109/CCGrid.2014.18).

>I'm also left wondering about results using stochastic solutions.

Another talk, I think they were running Smith-Waterman too. The speculative part was during the traversal of the matrix to get the edit distance. It's not the most time consuming part of the algorithm. I got in late for the talk and I didn't get to hear what they did about filling the matrix in the first place, but I imagine they might have done something similar. I'm not very familiar with Smith-Waterman so I can't go into details.