There's no margin of error for the test they did, just potential sampling bias.
Given the 5000 questions they asked, the system will provide the wrong answer 2.6% of the time. Every time, until they improve it. There's a chance that they managed to ask the only 150 questions that it doesn't know the answer to, but not a very big one.
I agree -- given the sample space, there is not effective way to calculate the margin of error. Even so, I don't think that there are many examples of non-deterministic mechanisms that produce the correct answer 97% of the time.
A parser (assuming you are talking about a programming language parser) has the luxury of having highly structured and deterministic inputs, and to be able to refuse giving an "answer" if they are not.
Given the 5000 questions they asked, the system will provide the wrong answer 2.6% of the time. Every time, until they improve it. There's a chance that they managed to ask the only 150 questions that it doesn't know the answer to, but not a very big one.