Human vs AI

Sometimes a small challenge reveals more than expected. It starts as a simple puzzle, a way to stretch the mind, but then it shifts. It stops being just a game and begins to say something about how we think or how we don’t.

It began with one phrase: “artificial intelligence.” Twenty-two letters exactly, no extras or omissions. Could I rearrange those letters into a completely new, grammatically correct phrase that made sense? A clean anagram using only what was given.

This seemed like the kind of problem an AI should handle easily. Counting letters, solving constraints, generating patterns. This should be its strength. I tried it myself and found one solution. Then another, a third, a fourth, and later, a fifth. Each time, I double-checked letter counts and grammar. All were valid, using only the available letters.

Then I gave the same rules to a leading large language model, asking for one correct answer to the original puzzle. I expected it to find a solution quickly. These models generate novels, write code, draft legal documents. A 22-letter anagram should be straightforward.

But it failed. Not once or twice, but over a hundred times. Many attempts broke the basic rules in obvious ways. Letters appeared that didn’t belong. Others were repeated too often. Sometimes the AI repeated the exact same failed answer, ignoring corrections as if nothing had changed.

At first, I watched out of curiosity. Then, with a growing fascination. It wasn’t just that the AI was wrong. It was how confidently wrong it was. Each answer came with the same assured tone, presenting itself as correct even when it was clearly not. Unlike a human, the model didn’t pause to rethink or adjust. It guessed blindly, going in circles. This was despite me including in the prompt that it should check its work before presenting it.

It might seem petty to take pleasure in this, but it revealed a real gap between human intelligence and what these systems do. We often hear AI is smarter than us: it writes faster, analyses better, processes more. In some ways, it’s true. It holds vast data and speed. But it lacks something smaller, harder to measure. The ability to stop and say, “Something isn’t right.” The instinct to backtrack, rethink, and check the details, not because a rule demands it, but because it just feels off.

That instinct was key in this puzzle. I sensed when certain phrases felt heavy or thin. The best answers didn’t just meet the rules; they sounded like real language, phrases you might actually say. That suggested that the AI had been flexible with the letters it used.

Tight constraints force a kind of thinking you can’t brute-force. You hold possibilities in mind, not individually but together. Your brain runs the problem quietly in the background, and suddenly the right answer appears. It fits before you even check the letter count.

That’s not how AI works, not yet. It has no quiet inner voice saying, “We’ve tried that.” No memory that feels lived-in across a problem’s shape.

Though this was only a simple game-type problem, it points to something bigger. Real-world problems have constraints, trade-offs, and ambiguity. They require more than calculation. They demand noticing when something’s wrong or almost right but not quite. Things we do naturally until we ask a machine to do them.

This isn’t a triumph for humans over machines, just a reminder that the way we think still matters. Not everything fits into data and logic. Sometimes the smallest puzzles reveal this most clearly. This time, the human solved it.

“Artificial intelligence” gives 22 letters to use, no more, no less. The rules are strict. Break them slightly and the whole thing falls apart.

Many of the AI’s attempts looked fluent at first glance. But counting revealed extra letters, overuse of some, or letters not in the phrase. Despite repeated corrections, the model didn’t adjust. It repeated failed phrases almost unchanged, as if forgetting they’d been ruled out moments before.

It wasn’t just wrong, it was persistently and confidently wrong. The mismatch between the assured tone and flawed content was striking. The model presented each failure as a success, miscounting letters repeatedly. It showed no awareness or revision.

The frustration wasn’t failure itself but repetition. The AI never paused to reconsider or learn. It churned out patterns, oblivious to feedback unless reminded every time. Every so often it suggested it write some python code to brute force an answer. I let it do this. It still got it wrong!

Working under the same rules, I found four valid solutions. Real phrases with real grammar. The first, “a tiger eel in a clinic lift,” felt odd but held together. Then “age lifter in a tile clinic,” more abstract, but all the words were correct. Third, “nice frantic agile Tillie,” looser and more descriptive. Fourthly, “a legit fertile clinician,” possibly the cleanest in meaning and grammar.

Each used every letter exactly once. No cheating, no duplicates. I kept the letter counts in my head. I knew there were only two a’s, three l’s, one g. No spreadsheet needed.

That’s the difference. I held the structure in mind, sensed when something was off, and changed course. The AI couldn’t do that. It needed constant reminders. It forgot past mistakes and recycled errors.

It’s not about intelligence but navigating constraints with awareness. That still belongs to us.

The fifth anagram I found was “illicit large tie finance.” I quite like this one, picturing money launderers in kipper ties (I digress). This time, though, I did something different. I asked the language model to check if it was a proper anagram. The AI initially rejected it twice, stating it was not an exact anagram. It pointed to differences in letter counts and claimed the phrase didn’t match the original exactly.

I pressed again, asking it to carefully break down the letter counts. After going over it multiple times, the model eventually admitted that “illicit large tie finance” did indeed perfectly match the letter counts of “artificial intelligence.” It finally confirmed that this phrase was indeed an exact anagram. I still had to check it myself, because it also admitted that "artificial intelligence" had four a's!

This back-and-forth highlights a key point about AI’s limits. Even on a precise, countable fact, the model wavered, needed repeated prompting, and only reluctantly conceded. It wasn’t confident initially, even when the correct answer was right before it.

A model trained on trillions of words should solve a letter puzzle easily. The task involved no obscure facts, just rearranging letters and obeying counts.

It’s tempting to blame memory. Maybe the model lost track of instructions or past attempts. But the deeper problem is structural.

These models aren’t rule-followers. They generate words one after another based on patterns in data. They aim for plausible phrasing, not logical correctness.

For example, when asked how many letters were in “artificial intelligence,” answers varied: 19, 21, or 22. Sometimes the model used 24 letters despite the phrase having only 22. Letters like “g” disappeared in many answers. The letter "s" appeared regularly!

You see, LLMs don’t count or cross-check against fixed lists. They predict based on familiarity, even if it means breaking the problem’s rules. They don’t ask, “Have I used this letter already?” They just keep generating.

Even when I gave the correct letter counts clearly, mistakes returned repeatedly. The model didn’t hold onto rules. It imitates plausible output, not cumulative logic.

It recycled failed answers, reworded slightly but essentially the same. Humans avoid repeating mistakes immediately, but the model had no memory of errors unless reminded every time.

The fluency was deceptive. Some phrases sounded elegant, but a closer look revealed fundamental flaws. The model is rewarded for flow, not truth.

This reveals a fundamental limit. Large language models don’t reason. They guess. They are virtuosos of plausible phrasing but fail where tight internal logic and accuracy matter.

They simulate memory through recent input but don’t track what they’ve said across attempts. There is no real “last time” or “already tried.”

I solved the puzzle five times with no false starts, miscounts, or repetition. No notes or software, just a mental map of letter frequencies and a feeling for combinations that worked.

It was controlled creativity, bounded, aware, purposeful. I filtered bad options and tracked what had been tried. That’s something LLMs don’t do on their own.

Interestingly, when I asked the AI to rate my performance, it estimated I had an IQ of somewhere between 145 and 160. I mention this, not because I think it is true, but because the AI “recognised” the task’s cognitive demand. It “knew,” on some level, what it had failed to do. That it wasn’t really that “intelligent” after all.

This puzzle might seem trivial but reveals serious risks when AI is used in real-world settings.

AI repeats mistakes, forgets contradictions, and can’t hold constraints without repeated reminders. These failures become dangerous in contract review, financial audit, compliance, or medical analysis where accuracy matters.

If AI can’t keep track of 22 letters, what happens with complex financial data? If it repeats errors in low-pressure tasks, how will it manage regulatory obligations? If it can’t cross-check outputs, how can we trust it for critical decisions?

Currently, AI shines in tasks with low stakes, drafting emails, summarising text, producing first drafts. It saves time when humans remain involved.

But where precision, structure, and traceability matter, AI alone isn’t enough. You need built-in guardrails, real-time checks, and clear escalation when errors occur. Human review at the end is not sufficient.

The problem isn’t memory alone. It’s reasoning. The model predicts next words by probability, not by logic or calculation. It doesn’t recognise contradictions or know when it strays.

When it fails, it doesn’t learn in the moment or across sessions. Each attempt is treated like the first. That works when stakes are low but not in domains where errors have costs.

This isn’t an argument against AI use but a call to use it wisely. Fluency doesn’t equal understanding. Confidence isn’t competence. Polished language can hide errors.

A 22-letter puzzle may be the clearest warning we get.

Using AI in important settings requires more than human oversight. Plugging in a model and trusting it to behave isn’t enough.

We need systems that verify as well as generate. Constraint-checking must run alongside output, not after. Escalation paths must be clear because errors will happen.

We must stop confusing plausibility with precision. These models sound right but often are not.

The model’s memory problem is bigger than people think. It repeats mistakes without learning. Humans learn from missteps. AI needs external help to avoid falling into the same trap.

Constraints must be enforced outside the model, not hinted at in prompts. You cannot trust it to count or cross-check on its own. Many people are “selling” how to write prompts that help avoid AI from hallucinating, but this is, to some extent, snake oil. An AI can never really constrain itself.

This is a defence of human intelligence. Tasks demanding detail, memory, and logic remain better handled by people, at least for now.

AI cannot yet simulate disciplined, strategic reasoning. Until it can, humans hold the edge.

There’s much talk about AI copying human intelligence. In some ways, it’s true: pattern spotting, recall, mimicry. But in logic, memory, attention, and self-monitoring, the gap remains wide.

This was no contest, but if it were, the score would be clear: one human, five correct solutions, no errors and done in 10 minutes. The model, over 110 attempts, none correct and taking over 2 hours with repeated prompts and checks.

Language models don’t reason, don’t hold ideas, don’t re-evaluate. They produce plausible-sounding text but don’t think.

Use AI to assist and speed up, but don’t rely on it to replace thinking where rules must be followed and applied with care.

These tools are remarkable but lack one thing we take for granted: the ability to ask, “Is this right?”

They don’t count unless told, and then they are not accurate. They don’t remember unless reminded, and then they still forget. They don’t know when they’re wrong unless checked, and even then they will do the same thing again.

The human mind, despite its limits, still outperforms machines in reasoned, self-correcting thought.

So the next time someone tells you AI can do what your best people can, give them 22 letters. Ask them to get their AI to make a sentence. Then wait. Not because it proves anything final. But because it shows something real.

This wasn’t about beating a machine for sport. It was about seeing, with unusual clarity, the line between generation and cognition. A language model can echo brilliance. But it can’t simulate the kind of disciplined, strategic reasoning that people do all the time without thinking twice. And until it can, we still have the advantage.

That’s worth remembering.