Hacker News new | ask | show | jobs
by IIAOPSW 1202 days ago
This is at best going to slightly improve the output. I asked about something hyper specific and also extremely trivial: transit wayfinding type queries specific to the NYC subway. All the semantics are services and stations. A service is nothing more than a list of stations it stops at and a station is nothing more than a list of services which stop at it. Eg if I tell it "I'm on the F, how do I get to Time Sq", it should cross reference the service/station lists, find the commonality, and tell me "switch to the A at W 4th".

No context about anything in the outside world or even about the physical nature and purpose of commuting is needed to comprehend and answer these questions. There is no external context or sophisticated logic beyond just matching those sorts of lists to each other. Its a wholly closed system of highly standardized tokens.

I ask it what stops the F and A have in common. 125th St is on the list. I know that's wrong. I ask it to list all the services that go to 125th. The F is not on the list (correctly so). I point out the inconsistency between the two outputs. It says sorry and does nothing. I tell it to remake the list of stops the F and A have in common. It is now missing a few stations from before, and has added new incorrect results.

This is just a snippet. I went in circles with it for probably an hour playing wack a mole with its inability to correctly recall more than 10 true details in a row. These were not obtuse, esoteric, or even logically complicated queries. Nor was it ambiguous stuff open to interpretation. Nor was it even something that you'd need a real meatspace body to comprehend, like the feeling of the sun on a summer day. This should have been a language models bread and butter.

1 comments

You're right, it seems particularly bad in navigation. I lead with some general queries on the NYC system, asked it for the common stops for service lines F and A, and it also hallucinated. The wiki for the service lines have pretty complete representations here, so this data should be suitably represented in its training corpus.

Would be interested in how a multimodal LLM such as PaLM-E (trained on maps, etc) fair in these sorts of queries.

https://www.reddit.com/r/MachineLearning/comments/11krgp4/r_...

I'm not hopeful that "seeing the map" (multimodal training) will make the difference everyone is hoping for. The transit map is going to be the exact same information as the lists, but coded in the learned design language of colored lines and dots. The words-only version should work just as well or better because it shortcuts the implicit OCR problem of trying to make it learn off the map. Indeed transit maps other than NY are often abstracted and have nothing to do with the underlying geography. So abstract representations (such as lists of words) should be fit for purpose.

Here's another one that fails spectacularly. The digits 0-9 drawn as an ASCII 7 segment display. It gets it mostly correct, but it throws in a few erroneous non-numbers and repeated/disordered/forgotten numbers. Asking it for ASCII drawings of simple objects can really go off the rails quickly.

The fault mode is very consistent. When a prompt forces it to be specific and accurate on an unambiguous topic for 10 or more line items, it will virtually always hallucinate at least one or two. Especially if the topic is too simple to hide behind a complex answer. Even if its learned not to hallucinate 90% of the time, and even if that's good enough to pass at first glance, within a list of 10 things it only has a 35% chance of not hallucinating any of them.

For what its worth, it did very well on law questions. Try as I might, it refused to accept that there's a legal category known as "praiseworthy homicide". Though, I suspect this has less to do with the underlying model and more to do with openAI paying special attention to profitable classes of queries.

I'm sorry to say, I think the problem maybe more intrinsic to the current approach to AI. Frankly, LLM works unreasonably beyond my expectations, but its making up for a lack of a real first principles theory of cognition with absurd amounts of parameters and training.