| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ryohkyo 1716 days ago

Thank you for the reply. Most of my repos are currently in private, so it's difficult for me to point you to the code I wrote. However, I can list a simple example here.

Genetic algorithm usually deals with chromosomes. I have used binary and hexadecimal chromosomes in the past and found that the binary chromosomes are more flexible, especially with bitwise operations.

Let's say a chromosome with four chunks of 4-digit binary, with each digit as dimension of true/false value; we end up with something like 1011 1100 0010 0101. Then I stored these four chunks in four documents in a NoSQL database. Each document then also has the records of other 16-digit chromosomes, so that I can refer to those 16-digit chromosomes contain an exact/similar 4-digit chunk. This was the fastest method I could come up with the last time I worked on it; I am sure that there are more efficient methods out there.

Hopefully, this can shed some light on how the genetic algorithm works.

1 comments

dmitrykan 1716 days ago

Thanks for more detail, this sounds quite interesting. So basically the 16-digit chromosome records you added to each document would work as nearest neighbors? Did this satisfy all your search needs for an algorithm?

link

ryohkyo 1716 days ago

In concept each 16-digit chromosome would work as the nearest neighbors, and finding “a range” of the nearest neighbors would be very fast.

In a data scientist’s perspective, this may seem to be a hack. I would be very appreciated to learn more from you about this. I can tell you that the weakness on this method is the multiple writes to the database. I assume vector database can gets this implemented with less writes.

In the past, I have used this for supervised training and yielded very good results. However, I think this would be inefficient in large scale networks. I am planning to use Go + Clickhouse to improve the performance in the next project.

link

dmitrykan 1716 days ago

Yes, for larger datasets of vectors, having the search do a linear scan will likely be slow. So you could take a look at KNN/ANN (K nearest / Approximate nearest neighbors), like https://faiss.ai/ But if you prefer to offload this complexity to a database, you can pick and evaluate one from my blog post.

You have 4 out of 6 open source DBs, and 2 commercial ones give you the managed service. All 6 can scale to quite large numbers of vectors.

My goal is to systematically study each DB through the lens of a specific search task, which will not be (only) text based.

If you think we could collaborate in some way on your dataset, that would be fantastic and probably a learning experience to both sides.

link

ryohkyo 1716 days ago

That would be great. I am wrapping up my current project and will start the next project soon. Is it ok for me to contact you on twitter?

link

dmitrykan 1715 days ago

sure!

link