再谈LLM逻辑推理的三大谬误

本着学术诚信的精神，我想要撰写这篇简短的后续文章，来回应那些最相关的反驳观点，并且重申，尽管存在这些反驳观点，我为何仍然支持大语言模型至今无法真正推理这一基本论断。
要是你对我之前那篇详尽无遗的文章有所疑虑的话，我希望这篇文章能够把论点阐释清楚，让你对整体情况有更全面的认识。不过，要是你已经认可这个观点了，我也建议你读一下这篇文章，因为它或许能给你提供更多论据，用来反驳一些最常见但却不恰当的观点。
首先，我得把话讲清楚。不管什么时候，只要你在网上争论大语言模型能做什么或者不能做什么，就会碰到一大群大语言模型的狂热粉丝。他们口口声声尊崇图灵，却沉溺于这个新玩意儿，根本看不到事情的本质。这些人与两年前那群希望完全依靠模因币（memecoins）来资助建立和运行私人虚拟国家的人，实际上是同一批人。
这篇文章不是为他们而写的。这篇文章是为明智、理性的读者而写，你们可能同意我的观点，也可能不同意，但和我一样，有着拓宽自身认知并达成共识的目标。
如果你读过这个博客上的任何文章，你可能已经知道我是人工智能的超级粉丝。我也是这个领域的全职研究员和学者。正因如此，我在探讨这个话题时，绝对致力于揭示人工智能能力与局限的真相。虽然大语言模型无疑具有开创性，但它们并非完美无缺。只有通过合理的批评，我们才能让它们变得更好。
遗憾的是，在人工智能领域，一些知名人士似乎也对这项技术极为痴迷。他们要么全然不知这项技术的局限性，要么根本就是伪君子，不择手段地进行推销。而且，他们还主导着网络舆论的走向。然而，盲目相信大语言模型被美化的那些方面，会使人们默认接受它内在的局限性，而不是批判性地去评估大语言模型应该在哪些地方、什么时间、以何种方式被应用，更重要的是，去评估它不该被应用的情形。
就像我之前的所有文章一样，这篇文章并不是在抨击那些支持大语言模型的人——我自己也是其中一员！相反，这篇文章旨在邀请渴望拓宽自身认知的读者进行公开对话。不管你对人工智能持何种态度，我的目的是让你相信，在广泛应用这项技术之前，我们需要解决它的一些根本性限制。在这个方向上已经有了大量工作，但还不够。如果这能让你们当中至少一个人有兴趣投身于使大语言模型（以及一般的人工智能系统）更稳健、更值得信赖、更可靠的事业，那我们就都成功了！
说完这些，我们进入重点部分吧。我想要纠正三个常见的误解或者谬论：1）在比较人工智能能力与人类能力时存在的谬论；2）关于随机性在人工智能中所起作用方面存在的一种不易察觉的误解；3）诸多关于让大语言模型实现图灵完备有多容易的相关误解。
但首先，我们先来明确一下，当我们说大语言模型不具备推理能力的时候，到底指的是什么意思。

In a previous article, I claimed LLMs cannot reason, and the Internet exploded. I was quickly reminded why I left Twitter early this year. Anyway, I received a lot of positive feedback and some poignant criticism. There were also many well-motivated but ultimately flawed or irrelevant counterarguments and some pretty good arguments that showed a lack of clarity in my explanation.

So, in the spirit of intellectual honesty, I want to write this short follow-up article to address the most relevant counterarguments and reiterate why, despite these, I still stand behind the basic claim that large language models, to this day, cannot truly reason. And so should you.

If you had some doubts about my previous article covering all the grounds, I hope this article clarifies the arguments and rounds up the picture for you. But if you are already convinced of this claim, I still urge you to read this article because it may give you extra ammo to defend against some of the most common—and misplaced—counterarguments.

But first, let me get something straight. Whenever you argue about what LLMs can or cannot do online, you’ll get a flurry of LLM zealots who cannot, for the love of Turing, see past their infatuation with their latest toy. More often than not, the same people that two years ago wanted to fund their own private virtual country running entirely on memecoins.

This article is not for them. It is for you, the sensible, rational reader, who may or may not agree with me, but have, like me, the goal of broadening your understanding and reaching a common ground.

If you’ve read any article on this blog, you probably already know I’m a huge fan of Artificial Intelligence. I’m also a full-time researcher and scholar in this area. For this reason, I approach this topic with an absolute commitment to uncovering the truth about AI’s capabilities and limitations. While LLMs are definitely groundbreaking, they are not perfect. It is only by reasonable criticism that we can make them better.

Sadly, some prominent figures in the AI community also seem infatuated with this technology and are either ignorant of their limitations or simply hypocrites who want to sell you the last coke in the desert. And they steer the online narrative. But blindly believing in angel horns can lead to an implicit acceptance of their intrinsic limitations rather than a critical assessment of where, when, and how LLMs should and, more importantly, should not be deployed.

Like all my previous articles, this one is not an attack on those who champion LLMs—I’m one of them! Instead, it invites open dialogue among readers eager to expand their understanding. Regardless of your feelings about AI, I aim to convince you that this technology has some fundamental limitations we need to address before letting it loose. There is a lot of work in that direction, but it is still insufficient. If this makes at least one of you interested in pursuing a career in making LLMs—and AI systems in general—more robust, trustworthy, and reliable, then we’ve all won!

Phew! That was a rant! With all of that out of my system, let’s get to the important part. I want to address three common misconceptions or fallacies: 1) the fallacy in comparing AI capabilities to humans, 2) a somewhat nuanced misconception about the role of randomness in AI, and 3) a bunch of related misconceptions on how easy would be to make LLMs Turing complete.

But first, let’s formalize what we mean when we say LLMs cannot reason.

Mostly Harmless Ideas is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Type your email...
Subscribe

人工智能中的推理是什么？

当我们这些研究人工智能的人声称大语言模型无法推理时，我们所说的并非 “推理” 这个词的抽象哲学意义，也不是它可能涉及的众多心理学和社会学层面的细微差别。我们所说的“推理”是一个非常具体、可量化且简化的概念，这直接源自数学。
简单来讲，推理就是从给定前提推导出合乎逻辑的结论的能力。在数学领域，主要有两种推理类型或者模式：演绎法和归纳法。归纳法存在一些问题，因为它需要从具体实例进行概括总结，所以需要一些相当严格的假设条件。与之相比，演绎法非常直接明了。它是指运用一组有限的逻辑推理规则，从现有的真命题中获取新的、可被证明为真的命题。这也是数学家们在证明新定理时整天都在运用的推理类型。
因此，当我指出大语言模型不具备推理能力时，我只是想说，存在一些（有时还相当简单的）演绎问题，这些问题是它们本身就无法解决的。这并非一种价值评判，也不是基于经验得出的看法。根据将推理定义为演绎推理的这种理解，再结合大语言模型架构和功能方面的固有局限，这是一个可以直接证实的论断。
如果这一点已经阐释清楚了，那我们就继续讨论针对这一说法的反驳观点吧。

What is reasoning (in AI)?

When we AI folks claim LLMs cannot reason, we are not talking about any abstract, philosophical sense of the word “reason”, nor any of the many psychological and sociological nuances it may entail. No, we have a very specific, quantifiable, simplified notion of reasoning that comes straight out of math.

Reasoning is, simply put, the capacity to draw logically sound conclusions from a given premise. In math, there are two main reasoning types or modes: deduction and induction. Induction is somewhat problematic because it involves generalizing claims from specific instances, and thus, it requires some pretty strong assumptions. In contrast, deduction is very straightforward. It is about applying a finite set of logical inference rules to obtain new provably true claims from existing true claims. It is the type of reasoning that mathematicians do all day long when proving new theorems.

Thus, when I say LLMs cannot reason, I’m simply saying there are—sometimes pretty simple—deduction problems they inherently cannot solve. It is not a value judgement, or an opinion based on experience. It is a straightforward claim provable from the definition of reasoning—understood as deductive reasoning—and the inherent limitations of LLMs given their architecture and functionality.

If this is clear, let’s move on to the counterarguments to this claim.

观点1：人类也有这些局限性

对于我提出的大语言模型不具备推理能力这一论断，我收到的最常见批评是：确实，大语言模型无法推理，可人类不也一样吗？毕竟人类可能会极度不理性。但这个论点在诸多层面存在漏洞，下面我们就来详细分析一下。
首先，人类在推理过程中确实会出错，但人类大脑无疑具备进行开放式推理的能力，我们共同构建了两千多年扎实的数学体系，这就是很好的证据。而且，所有大学生——至少理工科领域的大学生——在学习过程中都需要做一些结构化的练习题，这些题目要求他们运用逻辑推理得出正确结论，比如证明定理。所以，尽管人类有时候会很愚蠢，但经过训练，我们确实有能力进行非常严谨的推理。
但更为关键的是，这种说法其实是在转移注意力。为什么人类做不到某件事，就能够立马成为一项技术在这件事上表现糟糕的借口呢？想象一下，如果我们对其他所有技术都秉持这种态度会怎样。例如，那架飞机坠毁了，300人因此丧生，可人类本来就不会飞啊，所以就无所谓了吗？再比如说，那艘潜艇内爆了，可人类本就无法在水下呼吸呀，所以就不当回事了？还有，那个核电站发生熔毁，可人类无法承受3000度的高温，难道这就没什么大不了？
我们不会这么做的。我们会把任何一项新技术和现有的最佳解决方案作比较，只有当新技术至少在某些指标上比旧技术有所改进时，我们才会认为它是有价值的。
诚然，我们经常把人工智能的能力和人类的能力作比较，但这只是因为，对于我们常期望人工智能系统解决的各类问题来说，人类就是黄金标准。因此，我们会拿大语言模型创作富有创意故事的能力与我们最优秀的作家作比较，也会将大语言模型进行开放式对话或提供周到客户服务的能力与人类作对比，毕竟在这些任务上，还没有比人类做得更好的。
然而，有一些成熟的系统——例如传统的SAT求解器——在结构化的逻辑演绎和推理任务方面表现卓越。这些系统配备了严格的验证机制，以确保其输出的正确性和可靠性。它们基本上毫无差错，而且速度极快。所以，在演绎推理方面，我们不要将大语言模型与人类作比较，而是将它们与我们目前针对这个问题的最佳解决方案作比较。在这方面，大语言模型表现确实糟糕。

Argument 1: Humans Also Have these Limitations

The most common criticism I received against the assertion that LLMs cannot reason is that, sure, LLMs cannot reason, but neither can humans, right? I mean, humans can be stupendously irrational. But this argument is flawed on many levels, so let’s unpack it.

First, while it is true that humans can make errors in reasoning, the human brain definitely possesses the capacity for open-ended reasoning, as evidenced by the more than 2000 years of solid math we have collectively built. Moreover, all college students—at least in quant fields—at some point have to solve structured problem-solving exercises that require them to apply logical reasoning to arrive at correct conclusions, such as proving theorems. So, while humans can be pretty stupid at times, we are certainly capable of the most rigorous reasoning when trained to do so.

But even more importantly, this assertion is a red herring. Why the fact humans can’t do something immediately makes it ok for a piece of technology to suck at it? Imagine we did this with all our other tech. Sure, that airplane fell down and killed 300 people, but humans can’t fly, so there’s that. Or yes, that submarine imploded, but humans can’t breathe underwater. Or that nuclear power plant melted, but humans can’t stand 3000 degrees of heat, so what’s the big deal?

No, we don’t do that. We compare any new piece of technology with our current best solution, and only if the new thing improves upon the old—at least on some metrics—do we consider it worthwhile.

Granted, we often compare AI capabilities to human capabilities, but this is only because humans are the gold standard for the types of problems we often want AI systems to solve. So we compare LLM’s capacity to generate creative stories with our best writers, and we compare LLMs’ capacity for open-ended dialogue or for emphatic customer assistance with humans because there is nothing out there better than humans at these tasks.

However, there are well-established systems—such as traditional SAT solvers—that excel in structured logical deduction and reasoning tasks. These systems are designed with rigorous validation mechanisms that ensure correctness and reliability in their outputs. They are basically flawless and incredibly fast. So, instead of comparing LLMs to humans in deductive reasoning, let’s compare them with the best solution we currently have for this problem. And there, LLMs definitely suck.

观点2：随机性不是一种局限

第二个常见的批评是关于语言模型的随机性。简要重述一下，我的观点是，由于大语言模型按照概率机制生成词元（这是该范式的一个基本特征），当你要求绝对准确性而非通用性时，它们的输出本质上是不可靠的。
许多人正确地提出，事实上，随机性在解决问题的过程中是不可或缺的，而且也是许多SAT求解器（我想将其用于和大语言模型作比较）的一个关键特性。他们宣称，在我们现有最有效的演绎推理算法本质上是随机的情况下，我却把随机性当作一种局限，这是多么虚伪的行为啊。这确实是事实，但只是部分正确，不过这其中的差别可就大了。让我来解释一下吧。
随机性在许多计算问题解决技术中发挥着至关重要的作用，特别是在解决困难问题（即NP-Complete或NP-Hard）的搜索算法中。例如，现代SAT求解器经常采用随机搜索策略，以便高效探索庞大的解空间。通过在搜索过程中引入随机性，这些求解器能够跳出局部最优解，并且比确定性方法更快地发现令人满意的解决方案。这种利用随机性的能力是计算技术手段中的一个有力工具，能使系统处理那些原本难以解决的复杂问题。
然而——关键区别在于——在搜索过程中应用随机性并不意味着整个推理过程从本质上是不可靠的。随机性仅限于问题解决的搜索阶段，在这一阶段，它有助于识别可行的解决方案——即潜在推理路径。然而，一旦找到候选方案，就会进入确定性的验证阶段，严格检查所提出的推理路径是否正确。
理解搜索和验证阶段之间的区别对于认识随机性如何有助于有效解决问题至关重要。在搜索阶段，算法可能采用随机采样或其他随机方法来探索可能性并生成潜在解决方案。此阶段具有灵活性和适应性，能让系统对潜在答案的复杂情形进行探索。
然而，一旦确定了潜在的解决方案，就必须经过基于确定性逻辑的验证过程。验证阶段涉及应用既定的规则和原则，以确认所提出的解决方案是否满足所有正确性的必要标准。因此，任何通过这一验证步骤的解决方案，无论其最初是如何生成的，都可以被确信是有效的。
你可以想象有成千上万只猴子在随机打字，其中一只可能会偶然打出《罗密欧与朱丽叶》，但只有莎士比亚能从糟粕中筛选出精华，并决定哪篇手稿值得出版。
这个荒诞的比喻意味着，随机性在探索假设时是有用的，但在决定接受哪一个假设时却无能为力。对此，只要你想精确地解决问题，你需要一个确定性的、不依赖概率且可证明正确性的方法。
然而，与传统问题解决系统（如SAT求解器）形成鲜明对比的是，大语言模型缺乏一个强大的验证机制。虽然它们可以基于概率推理生成语义连贯且上下文相关的内容，其中一些可能是正确的推理链，但它们并没有可靠的方法来验证输出的准确性。验证过程也是随机的，并且容易生成不符合事实的内容，这使得该验证过程完全不可靠。
因此，由于大语言模型在评估输出时使用的是与生成输出时相同的概率推理方法，不可避免地存在把错误结论当作有效回答进行传播的风险。这就好比猴子既是打字者（生成内容）也是编辑者（评估内容）。

Argument 2: Randomness is Not a Limitation

The second most common criticism I received was regarding the stochastic nature of language models. To recap, I claim that since LLMs generate tokens in a probabilistic fashion—which is a fundamental feature of the paradigm—, their output is inherently unreliable when you require absolute accuracy instead of versatility.

A lot of people correctly argued that, in fact, randomness is essential in problem-solving and a crucial feature of many of the same SAT solvers against I pretend to compare LLMs. How hypocritical of me, they claim, to posit randomness as a limitation when the most effective deductive reasoning algorithms we have are essentially random. And this is true, but only partially, and it makes all the difference. So let me explain.

Randomness plays a vital role in many computational problem-solving techniques, particularly in search algorithms for hard (read NP-complete or NP-hard) problems. Modern SAT solvers, for example, often employ randomized search strategies to efficiently explore vast solution spaces. By introducing randomness into the search process, these solvers can escape local optima and discover satisfactory solutions more quickly than deterministic methods might allow. This ability to leverage randomness is a powerful tool in the arsenal of computational techniques, enabling systems to tackle complex problems that would otherwise be intractable.

However—and here comes the crucial difference—using randomness in the search process does not imply that the entire reasoning process is inherently unreliable. Randomness is confined to the search phase of problem-solving, where it helps identify potential solutions—potential reasoning paths. However, once a candidate solution is found, a deterministic validation phase kicks in that rigorously checks the correctness of the proposed reasoning path.

The distinction between the search and validation phases is paramount in understanding how randomness contributes to effective problem-solving in general. During the search phase, algorithms may employ random sampling or other stochastic methods to explore possibilities and generate potential solutions. This phase allows for flexibility and adaptability, enabling systems to navigate complex landscapes of potential answers.

However, once a potential solution has been identified, it must undergo a validation process that is grounded in deterministic logic. This validation phase involves applying established rules and principles to confirm that the proposed solution meets all necessary criteria for correctness. As a result, any solution that passes this validation step can be confidently accepted as valid, regardless of how it was generated in the first place.

You can have millions of monkeys typing in a typewriter, and at some point, one of them will randomly produce Romeo and Juliet, but only Shakespeare can filter the garbage from the gold and decide which pamphlet to publish.

That silly metaphor means that randomness is good for exploring hypotheses but not for deciding which one to accept. For that, you need a deterministic, provably correct method that doesn’t rely on probabilities—at least if you want to solve the problem exactly.

However, in stark contrast to traditional problem-solving systems like SAT solvers, LLMs lack a robust validation mechanism. While they can generate coherent and contextually relevant responses based on probabilistic reasoning, some of which may be correct reasoning chains, they do not possess a reliable method for verifying the accuracy of those outputs. The verification process is also stochastic and subject to hallucinations, rendering it utterly unreliable.

So, since LLMs evaluate their own outputs using the same probabilistic reasoning they employ for generating them in the first place, there is an unavoidable risk that incorrect conclusions will be propagated as valid responses. The monkeys are the also the editors.

观点3：大语言模型可以是图灵完备的

最后，我想讨论的一个观点是，通过将大语言模型与某些图灵完备的工具拼接在一起，就可使其成为图灵完备。以下是对这一观点的简要回顾。
大语言模型有一个固定的计算预算——它们对每个输入词元执行的矩阵乘法次数是固定的。这意味着存在一些它们无法解决的问题。这些问题可以分为两类。
首先，NP-Complete问题——例如判定一个逻辑公式是否有效的这种非常直接的问题——属于一类判定性问题，目前没有已知的多项式时间（polynomial-time）解法。此外，大多数专家认为，这样的算法根本不可能存在。因此，对于足够大的实例，这些问题可能需要指数级的计算量。所以，鉴于大语言模型固定的计算预算，无论你的大语言模型有多大，总会存在一个因其过大而无法求解的逻辑公式。
另一方面，还存在半可判定（semi-decidable）问题，即那些如果存在解，算法可以确认解，但如果不存在解，算法可能会无限运行的问题类型。对于这些问题，我们别无选择，只能在可能无法估量的时间内不断搜索答案。由于大语言模型计算能力是有限的，所以有些可解问题实例所需的计算步数将超过大语言模型所能提供的数量。
以上这些对任何稍微了解大语言模型工作原理的人来说都是显而易见的。然而，批评者常提出一个观点：通过将大语言模型与外部工具（如代码生成器或通用推理引擎）集成，甚至更简单地，通过递归程序，可以多次调用LLM，进而使其变得图灵完备。
这一点是正确的。原则上，你可以轻而易举地通过将大语言模型与本身已是图灵完备的东西拼接起来，使该大语言模型达到图灵完备。这就像用一根竹棍、一些胶带和一个运行正常的火焰喷射器来制作喷火器一样简单。
然而，仅仅在理论上使大语言模型图灵完备并不能保证它们会生成正确或可靠的输出。集成外部工具带来了复杂性和潜在的故障点，尤其是在大语言模型不能有效地管理与这些工具的交互时更是如此。
问题在于，当你将容易产生不符合事实的随机输出与需要精确输入的外部工具结合时，你得到的大语言模型虽然在理论上可以访问所有所需资源，但却无法可靠使用。
在依赖外部系统完成推理任务时，例如调用SAT求解器，大语言模型必须能够始终识别出要使用的适当工具并为其提供正确的参数。然而，由于大语言模型的概率性和易生成不符合事实的内容，它们很难可靠地做到这一点。即使成功调用了外部工具，也不能保证它们能在推理过程中正确地解释或应用外部工具的输出。
因此，图灵不完备性或有界计算本身可能并不是一个有力的论据，但如果与大语言模型的其他固有局限性——尤其是它们的不稳定性——结合起来时，很明显，即使是最先进的模型也无法保证总能解决推理任务。
最后的关键点是：近似推理还不够好。如果大语言模型每百万次中有一次无法得出正确的推理，这仍然意味着该大语言模型不具备推理能力。在所有实际应用场景中，你可能会对10次有9次或100次有99次正确的模型感到满意，但在关键任务中，只有完美的推理才能满足要求。
这是我的观点：大语言模型从设计上就无法进行完美推理。

Argument 3: LLMs Can Be Turing-Complete

The final argument I want to address is the notion that LLMs can be made Turing-complete by duct-taping them with some Turing-complete gadget. Here’s a brief recap of what this means.

LLMs have a fixed computational budget—a fixed number of matrix multiplications they perform per input token. This means there are problems that are inherently outside the realm of what they can solve. These problems fall into two categories.

First, NP-Complete problems—such as the very straightforward problem of determining whether a logical formula is valid—are a class of decision problems for which no known polynomial-time solutions exist. Moreover, most experts believe no such algorithm can exist. Thus, these problems probably require an exponential amount of computation for sufficiently large instances. Thus, given the fixed computational budget of LLMs, no matter how big your stochastic parrot, there will always be a logical formula that is simply to large for it to solve.

On the other hand, we have semi-decidable problems, those for which an algorithm can confirm a solution if one exists but may run indefinitely if no solution is found. For these problems, we simply have no option but to keep searching for a potentially unbounded amount of time. And since LLMs are computationally bounded, there are solvable problem instances that simply would require more computing steps than the LLM can produce.

Now, all of the above is clear to anyone who even superficially understands how LLMs work. However, a common argument posited by critics is that LLMs can be rendered Turing complete by integrating them with external tools, such as code generators or general-purpose inference engines, or even easier, let’s wrap it in a recursive procedure that can simply call the LLM as many times as necessary.

And this is true. You can trivially make an LLM Turing-complete, in principle, by duct-taping it with something that is already Turing-complete. You can also build a flame thrower with a bamboo stick, some duct tape, and a fully working flame thrower.

However, simply making LLMs Turing complete in principle does not guarantee that they will produce correct or reliable outputs. The integration of external tools introduces complexity and potential points of failure, particularly if the LLM does not effectively manage interactions with these tools.

The problem is, when you combine stochastic output—prone to hallucinations—with external tools that require precise inputs, you get LLMs that, in principle, have access to all the resources they may need but are incapable of using them reliably.

When relying on external systems for reasoning tasks—for example, having your LLM call a SAT solver when necessary—it is crucial that LLMs can consistently identify the appropriate tool to use and provide it with the correct arguments. However, due to their probabilistic nature and susceptibility to hallucinations, LLMs struggle to do so reliably. And even if they successfully invoke an external tool, there is no guarantee that they will interpret or apply the tool’s output correctly in their reasoning process.

So, Turing-incompleteness or bounded computation may not be a knockout argument on its own, but when combined with the other inherent limitations of LLMs—crucially, their unreliability—it is clear there are no guarantees even the most advanced models won’t fail to solve some reasoning task.

And here is the final kicker: approximate reasoning is not good enough. If the LLM fails one out of every million times to produce the right deduction, that still means the LLM cannot reason. For all practical purposes, you may be happy with a model that gets it right 9 out of 10 or 99 out of 100, but in mission-critical tasks, nothing short of perfect reasoning is good enough.

And that’s the claim: LLMs are incapable, by design, of perfect reasoning.

总结

这篇和前一篇文章的目的是让你相信两个观点：

目前，大语言模型缺乏执行定义明确的推理形式的能力，而这种推理形式对许多决策过程都至关重要。
我们目前完全不知道如何在不久的将来解决这个问题。

这很重要，因为将大语言模型推广为通用推理引擎的趋势日益增强。随着越来越多用户开始依赖大语言模型做出重要决策，其局限性所带来的影响也日益显著。在某些时候，有人会将生死攸关的决策托付给大语言模型，从而带来灾难性的后果。
更重要的是，要使大型语言模型在推理方面值得信赖，所面临的主要挑战是巨大的。尽管有持续的研究和实验，我们尚未找到有效弥合大语言模型能力与可靠推理所要求的严格标准之间差距的解决方案。目前，我们在这一领域的最大努力不过是权宜之计，未能解决随机语言模型范式的根本局限性。
需要强调的是，这些局限性并不削弱大语言模型在其他应用中的卓越表现。在创意写作、问答、用户协助、翻译、摘要生成、自动文档生成，甚至编程等领域，我们讨论的许多局限性实际上是其优势。
语言模型的设计目标是生成可信的、类人的、多样的、不一定非常精确的语言。随机语言模型范式正是为此任务优化的，并且在这方面表现出色。它比我们以往设计的任何其他东西都要好，但当我们要求大语言模型超出这一任务范围时，它们变得脆弱、不可靠，更糟糕的是，这种不可靠性还难以察觉。
如果大语言模型要实现我们对它们的某些极不现实的期望，我们必须优先解决可证明正确推理这一难题。在此之前，我们所拥有的只是一个“随机的鹦鹉”——一个有趣的小玩具，有一些有趣的用途，但并非真正具有变革性的技术。

Conclusion

The purpose of this and the previous article is to convince you of two claims:

Large Language Models currently lack the capability to perform a well-defined form of reasoning that is essential for many decision-making processes.
We currently have absolutely no idea how to solve this in the near future.

This matters because there is a growing trend to promote LLMs as general-purpose reasoning engines. As more users begin to rely on LLMs for important decisions, the implications of their limitations become increasingly significant. At some point, someone will trust an LLM with a life-and-death decision, with catastrophic consequences.

More importantly, the primary challenges in making LLMs trustworthy for reasoning are immense. Despite ongoing research and experimentation, we have yet to discover solutions that effectively bridge the gap between LLM capabilities and the rigorous standards required for reliable reasoning. Currently, our best efforts in this area are nothing but duct tape—temporary fixes that do not address the underlying limitations of the stochastic language modeling paradigm.

Now, I want to stress that these limitations do not diminish the many other applications where LLMs excel as stochastic language generators. In creative writing, question answering, user assistance, translation, summarization, automatic documentation, and even coding, many of the limitations we have discussed here are actually features.

The thing is, this is what language models were designed for—to generate plausible, human-like, varied, not-necessarily-super-accurate language. The whole paradigm of stochastic language modeling is optimized for this task, and it excels at it. It is much better than anything else we’ve ever designed. But when we ask LLMs to step outside that range of tasks, they become brittle, unreliable, and, worse, opaquely so.

If LLMs are to fulfill even some of our highly unrealistic expectations for them, we must prioritize solving the challenge of provably correct reasoning. Until then, all we have is a stochastic parrot—a fun toy with some interesting use cases but not a truly transformative technology.

文章来源：
英文：No, LLMs Still Cannot Reason - Part II
中文：OneFlow-再谈LLM逻辑推理的三大谬误