Hacker News new | ask | show | jobs
by berto4 1507 days ago
yeah exactly my question. LDA is probabilistic and very performant if you clean up the documents well. The approach using Bert seems pretty powerful given that you can now cluster based on semantics, not just word occurrence/frequencies as in LDA (though ngrams help). However using a clustering approach would mean that each document is a part of a single topic, rather than being made up of multiple topics. But this is a cool idea nonetheless. [EDIT] quickly checked it out, seems like it uses some kind of soft clustering so documents can occur in many clusters (topics)
1 comments

would it make sense to preprocess with a transformer style model to produce per document semantic vectors which can then be piped into LDA to find topic mixtures of those vectors?
Is that not exactly what's happening in TFA?
if TFA means "the forememtioned article", i don't think so. i'm not convinced that the clusters found and the frequencies in those clusters would be the same as what LDA computes with gibbs sampling or the variational calculus method would find. but i must admit it's been a while since i've played with this stuff.

if TFA is some other method, i am unfamiliar and would like to know more.

in my experience, while it's true that it's hard to score and verify these sorts of models, the hierarchical multinomial nature of LDA topic models makes it easy to generate data and then verify behavior in the fitting process by recovering generative model parameters used by the test data generation process. obviously this makes no sense for the bert frontend, but a comparison of the differing backend clustering methods could be interesting.

Well, they're not supposed to be the same clusters. The reason people develop new methods is to surpass the old ones.

I'm just saying that the method described in the link seems to be exactly what you are describing: using document embedding vectors as input to soft clustering mechanisms akin to LDA. Of course it does not interface perfectly with the theoretical underpinnings of LDA because those are quite constrained to tf-idf (generally count-based) inputs.

As an aside, "TFA" translates to "the fucking article" and is a reference to the classic Internet acronym "RTFM" standing for "read the fucking manual". Both are passive-aggressive-cum-colloquial ways to imply that answers are in places you would expect to find them, if only you go to read the source.

i'm pretty sure the method mentioned in the article finds single topic assignments where LDA finds mixtures of topics.

hierarchical in LDA refers to the stacked multinomial nature of the model over word counts, documents and topics.

hierarchical in bertopic means assuming and finding a hierarchical relationship between the topics themselves at cluster time.

they use the same word, but appear very different things, at least to me.

I think he might mean term frequency analysis?
Nevermind. I should have read what was said.