Hacker News new | ask | show | jobs
by deepnotderp 3397 days ago
Out of curiosity, I thought that genetics was the domain of gpus?
4 comments

I did sequence-based bioinformatics back around 2006 or so.

Very few of the operations used GPU. Things may have changed since I was working there, but the work at the time wasn't suited for a GPU architecture.

Initial step was sequence cleanup, which is a hidden markov model executed over a collection of sequences of varying length, so hard to parallelize. Sequence annotation is embarassingly parallel on a per-library basis (each sequence can be annotated independently of the other), but the computational work is fuzzy string matching, which is once again hard to GPU-ize. Another major computational job was contig assembly, which is somewhat parallelizable (pairwise sequence comparisons), but once again involves fuzzy string matching so not GPU-izable.

So that's just sequence genetics. Don't know if GPUs are used in other areas.

Lots of cores, lots of threads, and lots of main memory. That was the key.

"Lots of cores, lots of threads, and lots of main memory. That was the key."

Very much this. Which is why I ended up theorycrafting that the AMD many core CPU's would be so useful.

And still is ;) Partly because some key work loads just did not run well on GPU's due to lack of addressable memory. Lots of amdahls getting in the way. Some of the key use cases required stupendous large memory machines (genome assembly using only short reads).

Then a lot of code is very branchy but massively parallel leading to clusters of pure CPUs to be more flexible, which is important in research settings, and with higher utilization than mixed CPU/GPU clusters.

GPU code takes longer to get to market and has more specialized skills required then standard CPU orientated programming. Late to market means you miss a whole wave of experimental methods from the lab. i.e. GPU short read aligners came when long reads started to come out of the sequencing lab. Leading to people to stop doing short reads or at least stop doing pure short reads.

Secondly quite a bunch of the key staff at the large research institutes had been burned by previous hardware acceleration attempts and where not going to throw money at it until market proven.

Bio-informatics tends to cutting edge (the hemorrhaging kind) on the bio/lab tech side yet the production IT tends to balance that to doing the things we know as we already have enough risks. i.e. focus on the algorithms and robustness not on pure power.

Hmmm, isn't deep learning starting to pick up for genetics? No idea if it is actually is, but everyone in DL seems to be talking about it, I thought I'd ask someone actually in bioinformatics :)
I wish I could say, I it's been a good two+ years since I left the genetics company, so I've been in a different industry for a while. I would say theres probably plenty of room, if people start taking more novel approaches that use more data, eg, full microbiome analysis. Also, I was just a sysadmin, so I don't really know anything other than keeping systems running, so take what I say with a grain of salt.
I suspect DL will have a limited to modest role in the actual search / alignment part, and a lot more to do with the analysis part. This includes medical diagnosis, identifying regulatory patterns based on high throughput expression data, such stuff.

Not necessarily in comp. genetics / sequencing.. / the DNA stuff..

Xeon Phi tried to crack this nut and seems to have mostly failed so far.
I think there is plenty of room for GPU usage in bioinformatics, but there are some barriers that prevent it from gaining prevelance, such as cost vs cpus, and lack of updates (example, gpu-blast is still 1.1, blast is 1.2).
For most of the really time consuming steps the speedup isn't spectacular, 1.6x is not worth the effort

http://ce-publications.et.tudelft.nl/publications/1520_gpuac...

It's quite rare to find GPUs being used in genetics.
Is that because the workloads are fundamentally unsuitable for current GPU architectures or because no one has took a good stab at it yet?

I know very little about computation genetics/biology but it sounds interesting.

AFAIK probably a bit of both. A majority of genetics/biology workloads are I/O bound (mapping, blast, etc) and/or require a lot of memory (i.e. de novo assembly of genome)

On the other hand many of the bioinformatics software solve a specific scientific question and usually are written by people with mostly non-computational background. They use higher level languages such as Python/Perl/R and people often don't have the expertise or time to implement them for GPUs.

However now that machine learning and deep neural network approaches are being picked up by the field, the workloads might change a and also there are frameworks that make it easer to leverage GPUs (Tensorflow, etc)

> They use higher level languages such as Python/Perl/R and people often don't have the expertise or time to implement them for GPUs.

That's an interesting thought, has anyone ever attempted to get 'regular' programmers interested in this stuff as a 'game'/code golf kind of thing?

(Too many) Years ago one of the programming channels I was active in got distracted for 3 weeks while everyone tried to come up with the fastest way to search a 10Mb random string for a substring, not in the theoretical sense but in the actual how fast can this be done, that was the point I found out that Delphi (which was my tool of choice at the time) had relatively slow string functions in it's 'standard' library and ended up writing KMP in assembly or something equally insane, I got my ass handed to me by someone who'd written a bunch of books on C but eh it was damn good fun, it was also one of the first realizations I had just how fast machines (back then) had gotten and just how slow 'standard' (but very flexible) libraries could be.

Obviously the total scope of re-writing researchers code would probably be far far beyond that but if they could define the parts they know are slow with their code and some sample data I know a few programmers who would find that an interesting challenge.

Thanks for the response.

I don't think it is because no one has tried it as much as the fact that the workloads need the cpu architecture / are not easily parallizable (as far as I understand). Comp bio in genetics is largely sequence alignment & search, which is still largely CPU / memory bound; but I don't understand programming enough to speculate if development in algorithms will allow GPUs to be used because the problem itself is not parallelizable. I think of it as the difference between a super computer & a cluster..

(More than a decade ago, I struggled to / barely succeeded in building a Beowulf cluster; I am just amazed at how far both the hardware & the software tools have come..)

In other areas of comp bio though, GPUs I think are finding use. Protein folding, molecular dynamics. Also, with STORM & such: super resolution microscopy? I think increasingly, gpus will become important.

Also, whole cell simulations?

What you wrote about super computer vs cluster is quite right. Recently I attended a HPC meeting where we were the only DevOps of an HPC for a biological institute and most of the other people were from physics & chemistry. They usually don't consider the biology workloads as High Performance Computing but as big resource/data computing. The physics & chemistry guys run simulation using hundred thousands of cores and are mostly CPU bound. They use MPI and their nodes typically have not more than 64 GB and they consider 120 GB memory usage as a lot. Biologist on the other hand hardly use MPI because they can just parallelize the workload on the data level (i.e. sample or chromosome) and run them independently on each node. For that reason also high memory NUMA machines from SGI can relatively often be found.

You are also right that some of the comp bio areas (CryoEM, protein folding, molecular dynamics) are well suited for GPUs

Thank you for your response, it was extremely interesting.

One of the nice things about HN is you get to look outside your own bubble (I mostly do Line of Business/SME stuff so this stuff isn't just outside my wheelhouse it's on the other side of the ocean).

FPGA type applications will probably pay way bigger dividends than GPU acceleration ever will.

GPUs excel at problems where you can apply exactly the same logic to lots of data in parallel. CPUs can handle branching cases, where each operation requires a lot of decisions, a lot better.

Sufficiently large FPGA chips could accelerate certain parts of the workflow, if not the whole thing, since they're extremely good at branching in parallel. This is why early FPGA Bitcoin implementations blew the doors off of any GPU solution, each round of the SHA hashing process can be run in parallel on sequentially ordered data if you organize it correctly.

I've heard that annually for a decade or so.

FPGAs run hot, don't have many transistors, limited clock rate, and are a pain to program.

So yeah a "Sufficiently large" chip, a "sufficiently fast clock", and a "sufficiently well written app" could theoretically do well. Problem is in the real world they aren't and developers aren't targeting them.

CPUs and GPUs are a pain to program if you don't have the right tools. If it's tooling that's the huge impediment then maybe Intel's acquisition and (hopefully) tool realignment will help.

That the FPGAs use this proprietary and for all intents opaque binary format is not very helpful and is probably the biggest barrier.

In the past several years, quite a few developers have tried GPUs for analyzing bio-sequences, but found the speedup is modest. Good GPUs are expensive. It is usually better to put that money on CPUs or RAM.
//OT::

Your user name: a fan of the cre-lox system, or the enzyme itself?

Cool uid!

In my past life, I've used flp/frt & cre/lox; and studied mismatch repair enzymes. And topoisomerases.. :)