| HN Mirror

https://douglas-fraser.com/datadata/

this will be the blog (at some point). The overall idea of the dissertation was to see if combining different ways of processing the text of the reviews (classifiers using features from analysis of the grammar, vocabulary, etc) into a custom heterogeneous ensemble was better than using one classifier and the traditional ensemble creation methods (AdaBoost, bagging, etc). I figured creating a more holistic view of the text would be better; other studies have done this, but not to the extent I did. And I analyzed exactly why things did or did not work.

So it was just fundamentally a exercise in NLP; I did not use other signals like the # of reviews submitted in one day or other things like that. My gut says this general idea (a more holistic view) would apply to classifying other text, like fake news. But proving that is yet another project.

I still have a couple more angles (dependency and constituency parsing, framing) to add to the mix, so I'm not totally done. It will be a long series of blog articles. And I ended up having to deal with the problem of diversity vs. accuracy, so the dissertation went down a side road. My supervisor said it could be two potential papers for publication instead of one... At least I won't be bored for the next year.

Thanks for your interest! If you send me your email (dfraser@... is mine), I can send you the PDF, or pointers to other info about the research into fake reviews in general (e.g. using other signals like # of reviews/day); I'm not going to get the blog up soon - already dealing with a ML project for Network Rail here in the UK.