You are all 100% correct about the regex. Before converting the product-identifying matching code to Python, I did it in bash using grep -iw to match whole words.
for i in newton macintosh macbook ibook iie mac iphone ipod imac ipad II+ iigs LaserWriter osx 'apple ?tv' itunes '\]\[' imovie
do
stevejobs_tribute.txt |wc -l`"
echo "$i: `egrep -wi "${i}s?" $INPUTFILE|wc -l`"
done
But this was difficult to maintain. I wanted the ability to print a 'friendly' looking product name (the dict's key) and maintain the counts in a variable.
When I made the move from bash to python, I knew that there would be some overlap when I pushed this code (in the name of shipping!). I need to split the sentences into proper tokens and then check each token for a product match. I'm already splitting the sentence into tokens for part-of-speech tagging so it shouldn't be difficult to do.
tl;dr known issue on the Mac regex, I needed to publish it and get back to work!
True, he probably should change the regex to something like
"^mac" which would also match macintosh. So "^mac\s" should be fine. I would probably use:
^[M|m]ac[\s|\.]+
for i in newton macintosh macbook ibook iie mac iphone ipod imac ipad II+ iigs LaserWriter osx 'apple ?tv' itunes '\]\[' imovie do stevejobs_tribute.txt |wc -l`" echo "$i: `egrep -wi "${i}s?" $INPUTFILE|wc -l`" done
But this was difficult to maintain. I wanted the ability to print a 'friendly' looking product name (the dict's key) and maintain the counts in a variable.
When I made the move from bash to python, I knew that there would be some overlap when I pushed this code (in the name of shipping!). I need to split the sentences into proper tokens and then check each token for a product match. I'm already splitting the sentence into tokens for part-of-speech tagging so it shouldn't be difficult to do.
tl;dr known issue on the Mac regex, I needed to publish it and get back to work!