| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by npp 5581 days ago
	In an earlier comment, you said that much of the time was spent in constructing the features (e.g. you had to implement CSS). Did you mean implementation time, or training/classification time? This latest comment makes it sound like most of the time is in downloading the page, while the feature extraction is relatively fast. In any case, if the feature extraction is taking too much time, what is sometimes done is to dynamically select which features to extract for a test example based both on the expected predictive value (e.g. via mutual information or some other feature selection method) as well as the time it takes to actually compute the feature. This can be measured by, say, average computation time per feature on the training set. This can speed things up a fair bit if the feature extraction takes too long, since you only bother computing the features you really need, and are biased towards the ones that are quick to compute. This may not translate to your particular application, though, if I remember correctly, I've seen it used a while back for image spam classification.

1 comments

bravura 5581 days ago

Feature selection is an option, but not if all features require a certain preprocessing step.

My guess is that they need to render the page so they can determine the visual layout. So regardless of which visual features they use, the rendering step cannot necessarily be avoided.

link