| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gizmo686 4673 days ago

Its worth looking at the scale of the swear count. Linux has about 16629976 lines of code, and I'd estimate from the graph that it has about 370* swear words (excluding penguin). If you look at the second graph, that is less then 1 swear in 300000 lines.

I checked this on the source tree for 3.8.0. The numbers appear to be inflated by allowing the swear words to be part of other words.

For example, "shit" appears in 121 lines, but " shit " only appears in 10 lines. Looking at the offending lines, there is only one swearword that is missed by excluding spaces.

"fuck" appears 29 times, all of which are some conjugation of the verb (and some lines have duplicates I'm not counting).

"crap" appears 161 times, 20 of which are part of "scrap"

"bastard" appears 17 times, 6 of which go to email addressed hosted at "lazybastard.org" and "you-bastards.com"

"penguin" appears 99 times, two of which are jokes.

1 comments

kleiba 4673 days ago

If you want to check the various words in isolation, surrounding spaces might cost you some matches, e.g. at the end of a sentence ("It's a piece of shit.") or when followed by a comma. Also, did you ignore case ("Shit happens.")?

How about trying \b[Ss][Hh][Ii][Tt]\b and the likes?

link

gizmo686 4673 days ago

There were few enough curse words that I manually checked the output of not requiring spaces. Regarding the case sensitivity, it looks like I missed 12 instances of swearing because of that. Also, grep has a "-i" parameter, which makes it case insensitive.

link

archangel_one 4673 days ago

Also a -w parameter, to match whole words only, which is generally better than adding spaces :)

link