Hacker News new | ask | show | jobs
by ryohkyo 1716 days ago
Thank you for the reply. Most of my repos are currently in private, so it's difficult for me to point you to the code I wrote. However, I can list a simple example here.

Genetic algorithm usually deals with chromosomes. I have used binary and hexadecimal chromosomes in the past and found that the binary chromosomes are more flexible, especially with bitwise operations.

Let's say a chromosome with four chunks of 4-digit binary, with each digit as dimension of true/false value; we end up with something like 1011 1100 0010 0101. Then I stored these four chunks in four documents in a NoSQL database. Each document then also has the records of other 16-digit chromosomes, so that I can refer to those 16-digit chromosomes contain an exact/similar 4-digit chunk. This was the fastest method I could come up with the last time I worked on it; I am sure that there are more efficient methods out there.

Hopefully, this can shed some light on how the genetic algorithm works.

1 comments

Thanks for more detail, this sounds quite interesting. So basically the 16-digit chromosome records you added to each document would work as nearest neighbors? Did this satisfy all your search needs for an algorithm?
In concept each 16-digit chromosome would work as the nearest neighbors, and finding “a range” of the nearest neighbors would be very fast.

In a data scientist’s perspective, this may seem to be a hack. I would be very appreciated to learn more from you about this. I can tell you that the weakness on this method is the multiple writes to the database. I assume vector database can gets this implemented with less writes.

In the past, I have used this for supervised training and yielded very good results. However, I think this would be inefficient in large scale networks. I am planning to use Go + Clickhouse to improve the performance in the next project.

Yes, for larger datasets of vectors, having the search do a linear scan will likely be slow. So you could take a look at KNN/ANN (K nearest / Approximate nearest neighbors), like https://faiss.ai/ But if you prefer to offload this complexity to a database, you can pick and evaluate one from my blog post.

You have 4 out of 6 open source DBs, and 2 commercial ones give you the managed service. All 6 can scale to quite large numbers of vectors.

My goal is to systematically study each DB through the lens of a specific search task, which will not be (only) text based.

If you think we could collaborate in some way on your dataset, that would be fantastic and probably a learning experience to both sides.

That would be great. I am wrapping up my current project and will start the next project soon. Is it ok for me to contact you on twitter?
sure!