Hacker News new | ask | show | jobs
by georgemcbay 5443 days ago
"He found both reports were inaccurate (although NetAnalysis came up with the correct result), in part because it appears both types of software had failed to fully decode the entire file, due to its complexity. His more thorough analysis showed that the Web site sci-spot.com was visited only once — not 84 times."

How does that work? I mean, how do you examine what must basically be a log file (though perhaps in some binary format), come up with 84 hits but then realize it was only 1 hit and blame the problem on file complexity? Seems like such an issue would only result in underreporting, not overreporting. Where did the 84 number even come from?

2 comments

Here is a explanation from a the maker of a competing tool[1]. It actually delves into the Mork file format with the data from the trial. There are a couple 84's in the format and in the data, but what I think what happened is because there is no "visitedcount" when you have only visited a site once, it took the data from a previous row (in this case, a myspace page) and repeated the value.

If that is truly what happened, the fix is to simply re-initialize the visitedcount to 1 between rows in case there isn't a visitedcount listed.

[1] http://wordpress.bladeforensics.com/?p=357

Mork, as in what a netscape engineer once called ""...the single most braindamaged file format that I have ever seen in my nineteen year career"?

http://en.wikipedia.org/wiki/Mork_(file_format)

Thanks for the link, the extra detail there is very helpful in understanding what the original newspaper article glossed over.

However (from your link):

"It is a plain text format which is not easily human readable and is not efficient in its storage structures. For example, a single Unicode character can take many bytes to store."

My faith in the competency of "digital detectives" is not fully restored...

Hopefully this is just another case of someone simplifying things to increase readability to a mainstream audience, but every time I read something like that related to CS/programming/IT I cringe in horror at all of the things I must have a horribly half-assed understanding of by not being an expert in that field and building what little knowledge I have on the subject from articles like these.

What is wrong with that quote? When discussing common criticisms of Mork, wikipedia states that: "The conflicting requirements gave Mork several suboptimal qualities. For example, despite the aim of efficiency, storing Unicode text takes three or six bytes per character."

  $ grep 12.34.56.78 logfile | wc -l

  84
Maybe the complexity comes from there being 1 CSS file, 3 javascript includes, 58 images, and a number of AJAX calls on that HTML page?