| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oersted 777 days ago

This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.

They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.

I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.

PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.

3 comments

wilsonzlin 777 days ago

Hey, thanks for the kind words. I wasn't able to mention the costs in the post (might follow up in the future) but it was in the hundreds of dollars, so was reasonably accessible as a hobby project. The GPUs were surprisingly cheap, and was only scaled up mostly because I was impatient :) --- the entire cluster only ran for a few hours.

Do you have any links to your work? They sound interesting and I'd like to read more about them.

link

oersted 777 days ago

"Hundreds of dollars" sounds a bit painful as an EU engineer and entrepreneur :), but I guess it's all relative. We would think twice about investing this much manpower and compute for such an exploratory project even in a commercial setting if it was not directly funded by a client.

But your technical skill is obvious and very impressive.

If you want to read more, my old bachelor's thesis is somewhat related, from when we only had word embeddings and document embeddings were quite experimental still: https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_J...

I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.

link

gardenhedge 777 days ago

A golf membership can cost 1000s of euro.. Any hobby costs money

link

wilsonzlin 777 days ago

Thanks for sharing, I'll have a read, looks very relevant and interesting!

link

b800h 777 days ago

As an EU-based engineer, you wouldn't do this, it's a massive GDPR violation (failure to notify data subjects of data processing), which does actually have extraterritoriality, although I somehow doubt that the information commissioners are going to be coming after OP.

link

stavros 776 days ago

Processing comments on a forum being a violation of the GDPR? That's crazy, the OP is neither the data controller (HN is) nor a data processor on behalf of the controller. If you post your data in public, it's not a GDPR violation for people to use it for things.

link

alchemist1e9 777 days ago

The author is definitely very skilled. I find it interesting they submit posts on HN but haven’t commented since 2018! And then embarked on this project.

As far as funding/time, one possibility is they are between endeavors/employment and it’s self funded as they have had a successful career or business financially. They were very efficient at the GPU utilization so it probably didn’t cost that much.

link

wilsonzlin 777 days ago

Thanks! Haha yeah I'm trying to get into the habit of writing about and sharing the random projects I do more often. And yeah the cost was surprisingly low (in the hundreds of dollars), so it was pretty accessible as a hobby project.

link

PaulHoule 777 days ago

(1) Definitely you could use a cheaper embedding and still get pretty good results

(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.

link

Karrot_Kream 777 days ago

I didn't think the OP used LLMs? They did use a BERT based sentiment classifier but that's not an LLM.

My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.

link

PaulHoule 776 days ago

Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

https://huggingface.co/BAAI/bge-base-en-v1.5

is described as an "LLM" by the people who created it. It can be used in the SBERT framework.

I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

If you look at the literature

https://arxiv.org/abs/2405.00704

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

link

Karrot_Kream 776 days ago

> Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

Oh thanks! Right I had heard about T5 based embeddings but didn't realize it was basically an LLM.

> I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

XGBoost worked the best for me but maybe I should retry with other techniques.

> you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

Definitely. Use the right tool for the right job. LLMs are probably massive overkill here. My non-LLM based embeddings work just fine for my own recommender so shrug.

link

PaulHoule 776 days ago

Are you applying an embedding to titles on HN, comment full-text or something else?

When it comes to titles I have a model that gets an AUC around 0.62 predicting if an article will get >10 votes and a much better one (AUC 0.72 or so) that predicts if an article that got > 10 votes will get a comment/vote ratio > 0.5, which is roughly the median. Both of these are bag-of-words and didn't improve when using an embedding. If I go back to that problem I'm expecting to try some kind of stacking (e.g. there are enough New York Times articles submitted to HN that I can train a model just for NYT articles.)

Also I have heard the sentiment that "BERT is not an LLM" a lot from commenters on HN a lot but every expert source I've seen seems to treat BERT as an LLM. It is in this category in Wikipedia for instance

https://en.wikipedia.org/wiki/Category:Large_language_models

and

https://www.google.com/search?client=firefox-b-1-e&q=is+bert...

gives an affirmative answer in 8 cases out of 10, one of which denies it is a language model at all on a technicality that has since been overthrown.

link