Hacker News new | ask | show | jobs
by zubyak 2594 days ago
Maybe it's off topic, but anyway :

I'm a cs student, in my thesys I'll be working on a NGS C++ application. I need at least a brief introduction to "basic" sequencing but I'm struggling to find something accessible. Every book I find seems superspecialized. Now I'm reading "Insect Molecular Genetics : An introduction to principles and applications" but I'd like to read just a book chapter a little bit more advanced than the contents shown in this video https://youtu.be/ONGdehkB8jU

Any suggestions?

4 comments

I studied Biochemistry/Comp Sci and the foundational biochmeistry book imo is the Lehninger Principles of Biochemistry. It goes over the basic biochemistry and once you understand that things just start to “Make sense. Once you have those basics you can read the wikipedia article and things start to click.

On the other hand, as a person who’s worked on sequencing software I’ve found the biochemistry knowledge to only be incidentally useful - though I may be underestimating some of the “basic” assumptions that were used day to day.

>as a person who’s worked on sequencing software I’ve found the biochemistry knowledge to only be incidentally useful

I have the same feeling but I'm uncomfortable working on something knowing so little about it. I'll check out the book, thanks!

On the practical side, if you're working on a low level with NGS data, htslib[1] may be worth looking into. It is a C library for reading, writing, and manipulating data structures that are commonly used in NGS (BAM, VCF, etc). I have used it and can attest to its quality. However, as is the issue with all software related to genomics, its only documentation is its header files and example programs. Here is the very example I used to get started[2]. The comments in the header files are usually good enough.

The reason I'm recommending it is the quality of its interfaces. It can seamlessly handle (input or output) virtually any kind of file you throw at it (SAM, BAM, CRAM). I can't say the same for a lot of other software I have run into in this space.

[1]: https://github.com/samtools/htslib

[2]: https://gist.github.com/PoisonAlien/350677acc03b2fbf98aa

That video describes the process used before NGS was around. These days, using anything with plasmids would be pretty unusual.

There are several next generation sequencing technologies:

1) short read - Illumina - dominates most next-generation sequencing 2) long read - nanopore or pacbio.

These have very different analysis methods, have measurement errors that are very different, and even have different file formats, etc.

Short read is far more common, so you're probably in the "Data Analysis" of this:

https://www.youtube.com/watch?v=fCd6B5HRaZ8

But you need to know about the adapters and indices (how multiple samples can be sequenced at the same time).

But as another commenter mentions, knowing some particulars about the project would really help know what sort of tutorial would be appropriate. You'll need to also know about the biology of the application, in addition to understanding the sequencing technology.

the program will work on fastq files. The sequencing technology makes long reads.

As another commenter said, I don't need superdeep sequencing knowledge because my work will mostly be on the programming side (enhance performance, not adding new functionalities) but anyways it could be useful to have a clear picture of the process.

Thanks for your help

Unfortunately I don't have many long-read resources to share, but here's a short video about the process for the MinION nanopore sequencer for long reads:

https://www.youtube.com/watch?v=Wq35ZXyayuU

At about 1:30 there's a cartoon of the data signals that get processed into sequencing data.

It's been a while since I looked at long read data, but last time I did, the individual base calls in FASTQ files (A, C, G, T) have a fairly high error rate, and there are systematic biases in the errors, which makes it harder to correct them. Most of the processing of these data is trying to correct these errors, either by looking at a known reference sequence or by sequencing many times.

What sort of sequencing data are you planning to process? Are you planning to re-implement algorithms used by bwa/samtools or come up with something on your own? NGS is a very specialized field, so its very easy to get stuck in the weeds.