Hacker News new | ask | show | jobs
by Loughla 567 days ago
I just recently showed a group of college students how and why using AI in school is a bad idea. Telling them it's plagiarism doesn't have an impact, but showing them how it gets even simple things wrong had a HUGE impact.

The first problem was a simple numbers problem. It's 2 digit numbers in a series of boxes. You have to add numbers together to make a trail to get from left to right moving only horizontally or vertically. The numbers must add up to 1000 when you get to the exit. For people it takes about 5 minutes to figure out. The AI couldn't get it after all 50 students each spent a full 30 minutes changing the prompt to try to get it done. The AI would just randomly add numbers and either add extra at the end to make 1000, or just say the numbers added to 1000 even if it didn't.

The second problem was writing a basic one paragraph essay with one citation. The humans got it done, when with researching for a source, in about 10 minutes. After an additional 30 minutes none of the students could get AI to produce the paragraph without logic or citation errors. It would either make up fake sources, or would just flat out lie about what the sources said. My favorite was a citation related to dairy farming in an essay that was supposed to be about the dangers of smoking tobacco.

This isn't necessarily relevant to the article above, but if there are any teachers here, this is something to do with your students to teach them exactly why not to just use AI for their homework.

3 comments

My go-to to show people who don't understand its limitations used to be the old "how many Ms are there in the word 'minimum' or something along those lines, but looks like it's gotten a bit better at that. I just tried it with GPT4o and it gave me the right number, but the wrong placement. In the past it's given it completely wrong:

>how many instances of the letter L are in the word parallel

The word parallel contains 3 instances of the letter "L":

    The first "L" appears as the fourth letter.
    The second "L" appears as the sixth letter.
    The third "L" appears as the seventh letter.
I wonder if that is simply a reflection of there being more training data out there about this "problem" and the model hoovering all that up and regurgitating it?
These kinds of letter tricks aren't great examples to use because they're fixable and might stop working at any moment, as you just discovered. Modern image generation models for instance not only present embeddings of the input but also have parameters for each individual letter in words, which is why the latest models are able to copy words into the images much better: they see the spellings encoded as input parameters, as well as the concept vectors.

Even things like citations aren't going to work long term. The models are just being trained not to emit URLs from memory but only from their context.

It's fine to show people the current state of things, but if you're trying to get students to understand why they shouldn't use AI the only technique that will work is this: "you will be examined in a controlled room where computers and phones aren't allowed". Nothing else will be effective over the medium to long term.

Here's QwQ-32 answering the same question. I'm including the entire answer here including the CoT:

> Let's see. The word is "parallel." I need to count how many times the letter "L" appears in it. Okay, spelling it out: P-A-R-A-L-L-E-L. So, starting from the beginning, first letter is P, not L. Then A, not L. R, not L. Another A, still not L. Then L, that's one L. Next letter is L again, so that's two L's. Then E, not L, and another L, making it three L's. Wait, but when I spelled it again, I saw P-A-R-A-L-L-E-L, and I counted three L's. But let me check again. Parallel: P-A-R-A-L-L-E-L. Yes, L appears three times. So, the answer is three.

> But, maybe there's a trick here. Is there any special consideration for the letter L in this word or in general? Sometimes questions like this might be testing for homophones or something, but I don't think so in this case. It's straightforward: count the L's in "parallel."

> Alternatively, perhaps it's a trick with the word itself. Is "parallel" spelled differently sometimes? I don't think so. It's always P-A-R-A-L-L-E-L. So, three L's.

> Wait a minute, is there a possibility that the word is misspelled in the question? If the word was "paralell," that would be four L's, but that's not the correct spelling. So, sticking with three L's.

> Another angle: is the question referring to uppercase or lowercase L? But in the word "parallel," all letters are lowercase unless it's at the beginning of a sentence, but the question just says "the word parallel," which is typically in lowercase. So, doesn't matter.

> Also, considering that "l" and "I" can sometimes be confused, but in this word, it's L's only, no I's.

> So, my final answer is three instances of the letter L in the word "parallel."

> Final Answer: 3

This AI is talking as if it is in the final stages of a multi-day interrogation by a very hush-hush and deniable branch of government. I give it another five minutes before it breaks down and will sell out its mother just to be allowed to sleep with the lights off.

I hope that this is not something I ever have to deal with. It's exhausting.

funniest thing I’ve ever read on HN
> If the word was "paralell," that would be four L's, but that's not the correct spelling.

Better but this is still a hallucination.

Shows nicely what's going on.

If you ask a human, they will answer 3. Sometimes they say 4. Or 2. That's it.

An LLM produces a text using an example it was trained on. They were trained with these elaborate responses, so that's what they produce.

Whenever chatgpt gets something wrong, someone at openai will analyse it, create a few correct examples, and put these on the pile for retraining. Thats why it gets better - not because it is smarter, but it's retrained on your specific test cases.

They probably have a letter counting tool added to it now. that it just knows to call when asked to do this.

you ask it the number of letters and it sends those words off to another tool to count instances of L, but they didn't add a placement one so it's still guessing those.

edit: corrected some typos and phrasing.

Maybe we'll reach a point where the LLM's are just tool calling models and not really giver their own reply.

There are only 5 tools it has available to call, and that isn't one of them. A GitHub (forgot the url) stays up to date with the latest dumped system instructions.
I can't speak to all LLMs, but OpenAI has a built-in python interpreter. Assuming it recognizes the problem as "tokenization counting", it doesn't need a dedicated tool.
How do we know they’re the real system instructions? If they’re determined by interrogating the LLM hallucination is a very real possibility.
they probably just forgot to tell it humans are 1 indexed and to do the friendly conversion for them.
Do you have a link to (or can you put here) that "numbers in boxes" problem?
I'm not them, but I think it's a variation of the subset-sum problem

If we modify the question to be "sum to 100" (to just seriously reduce the number of example boxes required) then given:

  | 50 | 20 | 24 |
  |  7 |  5 |  1 |
  | 51 | 51 | 51 |
the solution would be

  | [50] | [20] | [24] |
  |   7  | [ 5] | [ 1] |
  |  51  | 51   |  51  |

  | right | down  | win
  | X     | right | up
  | X     | X     | X
Correct. That but for 1000. You can build your own with any number of online tools.

I don't have a link because it's part of a lesson plan set behind a payment on teachers pay teachers.

Do you have a link to any of those online tools that can be used to generate a puzzle of equivalent complexity to the one that you've tested on?
The sum doesn’t even have to be 1000, just create a grid of random numbers, create a random path and find the sum. It’s an equivalent problem.
I refuse to believe that you did any of this with any of the latest models. Gemini and Chat GPT with search are both perfectly capable of producing decent essays with accurate citations. And the 4o model is extremely good at writing python code that can accurately solve math and logic problems.

I asked 4o with search to write an essay about the dangers of smoking, along with citations and quotes from the relevant sources. NotebookLM is even better if you drop in your sources and don't rely on web search. Whatever you think you know about what AI is capable of, it's probably wrong.

--- Smoking remains a leading cause of preventable disease and death worldwide, adversely affecting nearly every organ in the human body. The National Cancer Institute (NCI) reports that "cigarette smoking and exposure to tobacco smoke cause about 480,000 premature deaths each year in the United States."

The respiratory system is particularly vulnerable to the detrimental effects of smoking. The American Lung Association (ALA) states that smoking is the primary cause of lung cancer and chronic obstructive pulmonary disease (COPD), which includes emphysema and chronic bronchitis. The inhalation of tobacco smoke introduces carcinogens and toxins that damage lung tissue, leading to reduced lung function and increased susceptibility to infections.

Cardiovascular health is also significantly compromised by smoking. The ALA notes that smoking "harms nearly every organ in the body" and is a major cause of coronary heart disease and stroke. The chemicals in tobacco smoke damage blood vessels and the heart, increasing the risk of atherosclerosis and other cardiovascular conditions.

Beyond respiratory and cardiovascular diseases, smoking is linked to various cancers, including those of the mouth, throat, esophagus, pancreas, bladder, kidney, cervix, and stomach. The American Cancer Society (ACS) emphasizes that smoking and the use of other tobacco products "harms nearly every organ in your body." The carcinogens in tobacco smoke cause DNA damage, leading to uncontrolled cell growth and tumor formation.

Reproductive health is adversely affected by smoking as well. In women, smoking can lead to reduced fertility, complications during pregnancy, and increased risks of preterm delivery and low birth weight. In men, it can cause erectile dysfunction and reduced sperm quality, affecting fertility.

The immune system is not spared from the harmful effects of smoking. The ACS notes that smoking can affect your health in many ways, including "lowered immune system function." A weakened immune system makes the body more susceptible to infections and diseases.

Secondhand smoke poses significant health risks to non-smokers. The ALA reports that secondhand smoke exposure causes more than 41,000 deaths each year. Children exposed to secondhand smoke are more likely to suffer from respiratory infections, asthma, and sudden infant death syndrome (SIDS).

Quitting smoking at any age can significantly reduce the risk of developing these diseases and improve overall health. The ACS highlights that "people who quit smoking can also add as much as 10 years to their life, compared to people who continue to smoke." Resources and support are available to assist individuals in their journey to quit smoking, leading to longer and healthier lives.

References

American Cancer Society. (n.d.). Health Risks of Smoking Tobacco. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

National Cancer Institute. (n.d.). Harms of Cigarette Smoking and Health Benefits of Quitting. Retrieved from https://www.cancer.gov/about-cancer/causes-prevention/risk/t...

American Lung Association. (n.d.). Health Effects of Smoking. Retrieved from https://www.lung.org/quit-smoking/smoking-facts/health-effec...

Cleveland Clinic. (2023, April 28). Smoking: Effects, Risks, Diseases, Quitting & Solutions. Retrieved from https://my.clevelandclinic.org/health/articles/17488-smoking

American Cancer Society. (n.d.). Health Risks of Using Tobacco Products. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/health...

American Cancer Society. (n.d.). Health Benefits of Quitting Smoking Over Time. Retrieved from https://www.cancer.org/cancer/risk-prevention/tobacco/benefi...

You exemplify well a big problem with LLMs: When people see accurate enough output on some test question and take it as evidence that they can trust the output to any extent on areas they don't dominate.
No, that's not really the case. I don't think you should trust LLM output at all, but I think in general it's closer to the level of reliability of wikipedia than it is to producing useless bullshit.

Which is to say that it's useful, but you shouldn't trust it without double checking.

It was chat gpt, and it was two days ago. Believe it or not, that doesn't really change anything for me. I absolutely understand what AI is capable of, and a lot of it is really quite something.

But you can't trust it to be accurate. You just can't. Every model will absolutely make shit up at some point.

I liken it to working with a very bright 7 year old. It may sound like it knows what it's saying, and it may be able to spit out facts, but it's very ignorant about most of the world.

> Gemini and Chat GPT with search are both perfectly capable of producing decent essays with accurate citations

OK, but this quote from your essay:

> The National Cancer Institute (NCI) reports that "cigarette smoking and exposure to tobacco smoke cause about 480,000 premature deaths each year in the United States."

...that citation is wrong. It's not from the NCI at all, the NCI cited that figure which came from another paper by the U.S. Department of Health and Human Services.

The essay doesn't have accurate citations, the model has regurgitated content and doesn't understand when that content is from a primary source or when it in turn has come from a different citation.