| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iudqnolq 1174 days ago

Every time I take a careful look at someone's handpicked example of GPT getting it right there's at least one serious mistake.

> Florence, Italy: Florence is a beautiful city with a rich history and culture. It is only a 90-minute train ride from Rome, making it a convenient stopover point. Florence also has plenty of cafes, co-working spaces, and other amenities that are conducive to a workathon.

Mistake: The purpose is to split up a long journey. Florence would split it into a 1.5h day and a 10.5h day.

> Marseille, France: Marseille is a bustling port city on the Mediterranean coast of France. It is about a 3.5-hour train ride from Rome and has a vibrant cultural scene and a variety of co-working spaces and cafes. You could spend the day exploring the city's many museums, parks, and markets.

Mistake: Factually wrong about the travel time, per Google maps GPT is off by a factor of 3.

Mistake: The purpose is to split up a long journey. Marseille would split it into a 10.5h day and a 2h day.

Mistake: Forgot about the workathon requirement

1 comments

vidarh 1174 days ago

W/GPT4 for me it suggests Rome->Genoa, stay, then Genoa-> either Nice or Marseille, and then on to Montpellier.

GPT3.5 suggested a list of Florence, Nice, Genoa and Marseille. When I asked it for a breakdown of the travel times, it got them pretty close ("Can you give me rough travel times for each of these options?") given the large variations in travel time depending on specific timing and transfers for several of these.

When I then asked it for somewhere closer to the midway point, it didn't come through (it suggested Pisa, which is about as bad as Florence), Turin, which is an alternative to Genoa, but doesn't fix the imbalance, or Lyon which is about as bad as the Pisa option but in the opposite direction (fast train to Montpellier), but the options also aren't in aggregate much worse.

The problem here is that there aren't as far as I can tell any very balanced options. You can try to do e.g. Ventimiglia or Sanremo (which GPT4 gave when I pressed it on splitting it more evenly), which will give you a more evenly distributed travel time, but because the overall travel time will be longer they're not at all obvious options.

You also seems to focus on catching it out rather than getting a result. If you want precision from the first query, then you need to give far more precise questions.

You need to treat it as a conversation, not a brief to be answered with a report. If you insist on treating it as a brief you will not get good results out of it. Your loss, in that case.

For cases where a simple lookup will work, it's a waste. Just use Google. For cases where there is in fact not a single perfect option, and where you need to weigh pro and con, it works well with the caveat that you do indeed need to be careful and check specifics. The same way I'd check specifics if I had a conversation about this with a friend.

With plugins, so it can verify precise details we can expect a significant leap in capability here.

But even for now, the answers I got to this were ones I was happy with, and another reason for me to use it more.

link

iudqnolq 1174 days ago

I plugged that into whatever chat.openai.com is.

> You also seems to focus on catching it out rather than getting a result.

Nope. I have nothing against GPT. I find Copilot useful enough to pay for. I'm just sick and tired of people promoting it with obviously wrong examples.

The story of Clever Hans shows that humans are very good at convincing themselves something is smart when really they're subconsciously feeding it answers. So I do think validating GPT requires thinking a little adversarially rather than aiming to help it.

> The problem here is that there aren't as far as I can tell any very balanced options

Then that's exactly what GPT should have said here.

> For cases where there is in fact not a single perfect option, and where you need to weigh pro and con, it works well with the caveat that you do indeed need to be careful and check specifics. The same way I'd check specifics if I had a conversation about this with a friend.

If someone had a track record of doing rote work decently but messing up anything requiring critical thinking then I'd entrust them only with work appropriate to their skillset until they proved otherwise. That's exactly what I'm doing with GPT: I use it to do my busywork, but I'm not asking it for anything like travel advice until it gets quite a bit better.

link

vidarh 1173 days ago

I didn't suggest you have anything against GPT, but that the way you're interacting with it is counterproductive if you want it to be useful to you like the person you first replied to rather than finding flaws in it. It's easy to find flaws in it. E.g. I just had a lengthy "argument" with it about the right directions between two places in Nice out of morbid curiosity of why it got the first question so wildly wrong (conclusion: It knows of lots of roads. It does not yet know how most of them are connected, which neighbourhoods they are in, which direction they go, or how to route; this should be unsurprising as it's unlikely to be in its training data; and that's fine, just don't use it for that).

In the example you picked apart, it got it close enough to be useful even though the answers have plenty of issues.

> The story of Clever Hans shows that humans are very good at convincing themselves something is smart when really they're subconsciously feeding it answers. So I do think validating GPT requires thinking a little adversarially rather than aiming to help it.

If the goal was to validate GPT, sure. But the goal above was not to validate GPT. The discussion was over whether it could be useful. That doesn't require "validating it". It just needs a rough understanding that is more right than wrong with respect to which types of queries are productive in producing results that saves us time without doing harm.

Yes, that means there are lots of applications where it's not suitable. That's fine.

> Then that's exactly what GPT should have said here.

I disagree. The question explicitly did not ask for that. It said it was too far to travel in one go, without explaining what the longest number of hours acceptable to travel in one day was. That a human might implicitly interpret it that way based on personal preferences might well be the case. But responding that there were no evenly split options would indicate a failure to carefully read the question. Explaining why the options did not split it evenly would be good (but GPT really would not be up to the job in this case).

Note that GPT still gets this plenty wrong, so I'm not suggesting it's up to scratch in this area.

Adding a constraint of no more than 8 hours per day, GPT3.5turbo (free ChatGPT) messes up (still suggests Florence, and gives the nonsensical suggestion of Avignon). GPT4 (paid ChatGPT only) suggests Genoa and Nice.

Lowering the threshold to 6 hours (which AFAIK is not possible), GPT3.5turbo gives the same broken set of options. GPT4 still suggests Genoa and Nice, wrongly claiming no more than 6 hours per day, so that is definitely a problem. In this case it should have said there's no way of doing that.

Trying to be more explicit about this (" We want to travel no more than 6 hours from Rome to the stopover, and no more than 6 hours from the stopover to Montpellier") does not help, so this is indeed a strong indicator that it struggles with this particular type of constraint and you shouldn't trust it on this subject other than to give ideas.

When thinking of what would make it get this right, it's not surprising: There likely aren't that many travel descriptions containing distances and travel times in its training data, and it can't read maps yet.

> If someone had a track record of doing rote work decently but messing up anything requiring critical thinking then I'd entrust them only with work appropriate to their skillset until they proved otherwise. That's exactly what I'm doing with GPT: I use it to do my busywork, but I'm not asking it for anything like travel advice until it gets quite a bit better.

That's exactly what I'm suggesting. Maybe with the extension that asking it "tell me how to do X" questions often works well, and that sometimes even questions where you know it'll mess up the details will give you enough ideas to go on. E.g. in this case, at least GPT4 gives reasonable options even though the travel times are messed up (when the plugins are opened up, hopefully this will improve significantly). That might not matter for a region you know, but for a region you don't, getting a list of cities to plug into route planners might still be worthwhile as long as they're more right than wrong.

link

iudqnolq 1173 days ago

> but that the way you're interacting with it is counterproductive if you want it to be useful to you like the person you first replied to rather than finding flaws in it

I can't speak to the answer the person I replied to got, because they didn't post it. If they got the answer I got asking the same question the answer they got wouldn't have been useful to them.

It's only counterproductive if I'm wrong. If I'm right that it's not yet useful for this sort of thing I'd only waste my time giving it more chances.

> in this case, at least GPT4 gives reasonable options even though the travel times are messed up

The reasonability of the options fundamentally depended on specific facts in this case. Mixing up a 3 hour train ride and a 12 hour train ride ruined the answer. So the answer I got from ChatGPT was fundamentally broken.

link