Hacker News new | ask | show | jobs
by uoaei 1507 days ago
Is that not exactly what's happening in TFA?
1 comments

if TFA means "the forememtioned article", i don't think so. i'm not convinced that the clusters found and the frequencies in those clusters would be the same as what LDA computes with gibbs sampling or the variational calculus method would find. but i must admit it's been a while since i've played with this stuff.

if TFA is some other method, i am unfamiliar and would like to know more.

in my experience, while it's true that it's hard to score and verify these sorts of models, the hierarchical multinomial nature of LDA topic models makes it easy to generate data and then verify behavior in the fitting process by recovering generative model parameters used by the test data generation process. obviously this makes no sense for the bert frontend, but a comparison of the differing backend clustering methods could be interesting.

Well, they're not supposed to be the same clusters. The reason people develop new methods is to surpass the old ones.

I'm just saying that the method described in the link seems to be exactly what you are describing: using document embedding vectors as input to soft clustering mechanisms akin to LDA. Of course it does not interface perfectly with the theoretical underpinnings of LDA because those are quite constrained to tf-idf (generally count-based) inputs.

As an aside, "TFA" translates to "the fucking article" and is a reference to the classic Internet acronym "RTFM" standing for "read the fucking manual". Both are passive-aggressive-cum-colloquial ways to imply that answers are in places you would expect to find them, if only you go to read the source.

i'm pretty sure the method mentioned in the article finds single topic assignments where LDA finds mixtures of topics.

hierarchical in LDA refers to the stacked multinomial nature of the model over word counts, documents and topics.

hierarchical in bertopic means assuming and finding a hierarchical relationship between the topics themselves at cluster time.

they use the same word, but appear very different things, at least to me.

I think he might mean term frequency analysis?
Nevermind. I should have read what was said.