| HN Mirror

A couple of things here.

First, you are conflating the underlying log relevance scoring ML system with the GPT-3 summarizing system. ML is a good fit for relevant log identification for the reasons you describe, although characterizing this software as root cause identification is not very accurate in my opinion, based on the examples you can find on their website. But the value of summarizing a log line into natural language is low, while the cost of misleadingly characterizing that log line is high. Whoever needs to debug this system and find the real root cause (e.g. why did the system go OOM?) probably needs certainty more than the convenience and in all likelihood, they are more likely to correctly summarize what the log line says than GPT-3 is (obviously we don't know since there is no evidence, but I don't work with any engineers whose ability to summarize the contents of a log line would be described as "mostly not misleading").

Secondly, I can't agree with this sentence:

> AI excels at long-tail problems where the cost of failure is high, precisely because human failure is such an expensive problem in those cases

Maybe it depends on domain and tech, but in my experience humans don't fail on out-of-sample data nearly as often as AI does. When they do fail, it is often more predictable to other humans and humans inherently have the ability to assign confidence levels to their conclusion which you don't see in many AI models such as GPT-3. Humans are also more effective at applying rules (e.g. common sense) to improve predictions on out-of-sample inputs. I think of "AI is worse than humans at generalizing to out-of-sample" as being a widely held, well-evidenced belief, but I would be interested if you disagree.

For me, the quintessential example is something like traffic light identification, where models generally struggle to identify unseen variants correctly while humans rarely struggle at it. What examples are you thinking of where AI excels at long-trail problems?