Hacker News new | ask | show | jobs
by BioGeek 556 days ago
> Also can we train this same model on regular language data so we can converse about the genomes?

Yes! That is what has been done in ChatNT [1] where you can ask natural language questions like "Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5." and the ChatNT will answer with "The degradation rate for this sequence is 1.83."

> My biggest point of confusion is what type of practical things these models can do.

See for example this notebook [2] where the Nucleotide Transformer is finetuned to classify genomic sequences as two of the most basic genomic motifs: promoters and enhancers types.

Disclaimer: I work at InstaDeep but was not involved in either of the above projects.

[1] https://www.biorxiv.org/content/10.1101/2024.04.30.591835v2 [2] https://github.com/huggingface/notebooks/blob/main/examples/...

1 comments

Possibly a dumb question - but are these models useful for homology finding? If you have two homologous genes, do they have similar embeddings?

The reason I ask is I have a bunch of genes where I can’t get much better than a 1:many orthology mapping, and if this method can capture related promoters/intronic regions etc per gene, and tell me if they are related, that would be a huge help (assuming this works on eukaryotic genomes).