Hacker News new | ask | show | jobs
by thisduck 5348 days ago
Wouldn't the regex for "mac" also match "macintosh" and "imac" and inflate the numbers?
3 comments

You are all 100% correct about the regex. Before converting the product-identifying matching code to Python, I did it in bash using grep -iw to match whole words.

for i in newton macintosh macbook ibook iie mac iphone ipod imac ipad II+ iigs LaserWriter osx 'apple ?tv' itunes '\]\[' imovie do stevejobs_tribute.txt |wc -l`" echo "$i: `egrep -wi "${i}s?" $INPUTFILE|wc -l`" done

But this was difficult to maintain. I wanted the ability to print a 'friendly' looking product name (the dict's key) and maintain the counts in a variable.

When I made the move from bash to python, I knew that there would be some overlap when I pushed this code (in the name of shipping!). I need to split the sentences into proper tokens and then check each token for a product match. I'm already splitting the sentence into tokens for part-of-speech tagging so it shouldn't be difficult to do.

tl;dr known issue on the Mac regex, I needed to publish it and get back to work!

You can use '\b' which matches "word boundaries", so the regex would be something like "\bmac\b".
Thanks, I'm going to update the code and re-run the numbers.
True, he probably should change the regex to something like "^mac" which would also match macintosh. So "^mac\s" should be fine. I would probably use: ^[M|m]ac[\s|\.]+
Wouldn't it also match "machine", "stomach", "diplomacy", etc?