| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ramraj07 2575 days ago
	Never heard anyone around me talk about self-similarity in feature sets for ml models, could you explain some more?

1 comments

SubiculumCode 2575 days ago

I think the poster might be indicating the case where training data are higly redundant...

link

claytonjy 2575 days ago

right, it could be simple multicollinearity, or more complex relationships. Because RF is such a good first-try model, I often want to use it on feature sets I haven't carefully pruned, which can be dangerous if you're measuring the same underlying thing in multiple ways.

link

SubiculumCode 2574 days ago

Since it seems you know a bit of data science, may I ask you a quick question?

In my line of research I am frequently trying to use high dimensional data, but with few examples (<100 per class). Thus methods like SVM are used. I've been thinking about how I might leverage my sample to artificially simulate new training examples via pairwise warping of images within each class, with the assumption that informative features will be preserved with warping.The training examples within class are already quite variable, so I don't think a little increase in redundancy will hurt me much..but I am not sure.

Without knowing more concretely, do you have thoughts on such a strategy?

Data are 3D brain images and classes are disorder groups.

link

claytonjy 2574 days ago

this can be tricky, because it varies so much by domain. I imagine you have a good handle on the domain, so you can hopefully do a good job defining reasonable noise on each measure.

You can also try more generic upsampling techniques, like SMOTE, which should be easy from python or R. It's never actually helped me, but I assume it's useful somewhere.

I suspect at some point you're going to need to take an axe to some of your inputs, preferably based on human priors rather than a sketchy feature-selection process.

SVM's are great, but once you get past linear boundaries there's enough tuning complexity that I'd rather use that effort tuning a GBM. That's largely because of tooling though; I know there are modern SVM libs, but I haven't used them. Definitely try a random forest if you haven't!

link

SubiculumCode 2573 days ago

Thanks for your input.

link