Reasoning with Large Language Models

By: Ashish Kulkarni – Associate Research Director – Rakuten

“Cogito, ergo sum” meaning, “I think, therefore I am” is the foundational philosophical principle underlying the existence of us, humans. Expanding on this, Descartes – the source of this maxim – refers to, amongst multiple aspects of thinking, our ability to ‘reason’. Putting aside the fact that he calls to question the infallibility of our reasoning, in this article, we will focus on understanding what reasoning is, its different dimensions, and its relevance in our quest towards human-like AI, especially, in the context of the recent developments around large language models (LLMs).

What is reasoning?

“Reasoning is a robust source of hope and confidence in a world darkened by murky deeds” Amartya Sen in The Idea of Justice

Let’s consider the following conversation snippet.

Customer: Hi! Could you help me plan a trek to the Everest base camp?

Agent: Oh, how I love the Himalayas! I will be happy to help. When are you planning?

Customer: I was thinking of next month.

Agent: Monsoons are typically not a good season for a trek in the Himalayas. Sometime in September might be better. Would you like to plan then?

While at the outset it seems like any other conversation, let’s break it down. The agent: 

  1. receives a request from the customer;
  2. understands that they want to plan a trek to the ‘Everest base camp’ ‘next month’;
  3. knows that ‘next month’ is June, that Everest is in the Himalayas, recalls from the knowledge that it is the monsoon season, that it is not safe to trek in the Himalayas, that the monsoons last until September and therefore, that might be a better time;
  4. to then formulate the response.

Put simply, ‘reasoning’ is all that transpired in the agent’s mind (#3) between understanding the customer’s request and providing the response that they did. It is the stepwise thinking and analysis of information received from the environment (customer inquiry), and from existing knowledge (agent’s experience or commonsense or the Web) to arrive at the logical response. From the customer’s perspective, reasoning is what helps them understand the ‘why’ behind the agent’s response.

Reasoning clearly is an important ingredient that makes conversations personable. But beyond the conversational use case, it finds relevance in several domains like medicine, legal, education, public policy, business strategy and others that humans are known to excel.

Reasoning dimensions

There is extensive research on different types or dimensions of reasoning based broadly on the source of knowledge and the nature of analysis involved in the process of reasoning. This characterization is important as on one hand, it helps in the scientific study of the performance of LLMs (or AI tools in general) and on the other, uses this understanding to inform our design choices in the development of AI applications. While it is hard to say if there is an agreed-upon taxonomy, let us look at a few reasoning dimensions, that although not exhaustive, should suffice as an introduction and understanding of what follows.

Quantitative / Arithmetic reasoning – requires numerical or arithmetic knowledge to reason. Math word problems are classic examples that involve arithmetic reasoning. For instance:

Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can have 3 tennis balls. How many tennis balls does he have now?

Commonsense reasoning – is important in applications that are customer-facing or involve interactions with the world that often require them to reason about physical and human interactions under the presumption of general background knowledge. Here’s an example from the BIG-bench benchmark:

Did Aristotle use a laptop?

Requires commonsense reasoning based on:

  1. when did Aristotle live?
  2. when was the laptop invented?
  3. is #2 before #1?

Symbolic reasoning – typically involves the use of symbols, as synonymous substitutes for longer expressions that are required frequently, and their manipulation using logic rules. For instance, consider the coin flip task that requires one to answer whether a coin still heads up after people either flip or don’t flip the coin. “A coin heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Is the coin still heads up?”

COIN_STATE = heads;

COIN_STATE = tails IF IS_ODD( COUNT_FLIPS([+1, 0]) )

Causal reasoning – is reasoning about the world using cause-and-effect relationships and is crucial to making correct decisions and building robust systems. Consider this example from the business advisor domain:

A toy shop owner in the western USA wants to decide whether the ad they bought in early December is really better than their previous ads. Here is their sales data: October: $10,200; November: $10,000; December: $13,000; January: $10,100. They now want to decide which ad to show in February. Can you help them decide whether the increase in sales in December was due to the ads? Note that the new ad costs $1000 more to produce, so the toy maker is interested in maximizing their profit.

Further, there are other types of reasoning, like, knowledge-based reasoning – a more general form of commonsense reasoning over an explicit representation of a knowledge base, logical reasoning, monotonic and non-monotonic reasoning and so on. We will limit our discussion on this topic and move our attention next to the study of the reasoning ability of large language models.

Reasoning in Large Language Models

Large language models (LLMs) should need no introduction. Owing to the popularity of ChatGPT, LLMs are now a piece of common knowledge. Essentially, these are multi-billion parameter models, trained on massive corpora of texts, that have shown promising performance on language tasks. Scaling these models to larger sizes has shown the emergence of abilities like few-shot learning from instructions in natural language and a few examples, a technique that is popularly referred to as ‘prompting’. The figure below shows an example of a prompt.

What followed was a slurry of models being trained, hundreds of billions of parameters in size, each larger than its predecessor, with the hope of improving this few-shot in-context learning performance of LLMs. Unfortunately, their evaluation of hundreds of diverse language tasks revealed that, while these models achieve state-of-the-art performance on multiple tasks including reading comprehension, fact-checking, identification of toxic language and others, challenging tasks such as arithmetic, commonsense, and symbolic reasoning see less benefit from merely scaling up the model size. A recent study that probes LLMs for their performance on causal reasoning tasks also illustrates their mixed behaviour.

So, does this mean that LLMs are not yet ready to take on applications that require reasoning?

Well, for now, perhaps, but the following directions hold a lot of promise in raising the reasoning abilities of LLMs.

Chain-of-Thought Prompting

The human reasoning process typically follows stepwise reasoning: next month is June … so it’s monsoons … not safe to trek in the Himalayas … Everest base camp is in the Himalayas … so let’s consider planning after the monsoons. Inspired by this intuition, the chain-of-thought prompting technique proposes to augment the in-context prompts in LLMs from the traditional <input, output> to the triple <input, chain of thought, output>, where the chain of thought is a series of intermediate reasoning steps in natural language that lead to the final output. The figure below shows an example.

This apparently simple change to the prompts has shown remarkable improvements in the reasoning performance of LLMs on arithmetic, commonsense, and symbolic reasoning benchmarks.

Integration of LLMs + Reasoning tools + Humans

High cognition tasks like business strategy, public policy and similar decision-making have relied on human expertise and causal analysis tools. While LLMs might lack robust complete reasoning on these tasks, they have shown promising performance on sub-tasks, like, causal discovery, thereby, opening up the possibility of building causal reasoning pipelines that seamlessly transition control between LLMs, formal reasoning tools and human experts as appropriate. This is also the motivation behind approaches like ‘Program Aided Language’ (PAL) models that use LLMs to convert natural language problems into reasoning steps represented as programs, but then delegate the solution step to a runtime such as a Python interpreter.

This has also sparked a revived interest in the symbolic AI research that has long been developing techniques and tools for knowledge representation and reasoning. There is a recent body of research, for instance, that explores the integration of LLMs with symbolic solvers or studies the use of symbolic logic rules as the representation language for in-context exemplars.

In this article, we approached reasoning from the context of large language models – an artefact of the neural AI paradigm. While we did not cover the symbolic AI approaches to reasoning (topic for another discussion!), the coming together of the two can only mean one thing – there are exciting times ahead!


Srivastava, Aarohi, et al. “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.” arXiv preprint arXiv:2206.04615 (2022).

Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.

Wei, Jason, et al. “Chain of thought prompting elicits reasoning in large language models.” arXiv preprint arXiv:2201.11903 (2022).

Kıcıman, Emre, et al. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” arXiv preprint arXiv:2305.00050 (2023).

Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.” arXiv preprint arXiv:2302.13971 (2023).

Joshi, Pratik, et al. “TaxiNLI: Taking a ride up the NLU hill.” arXiv preprint arXiv:2009.14505 (2020).

Sen, Amartya. “The idea of justice.” Journal of human development 9.3 (2008): 331-342.

Geva, Mor, et al. “Did Aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.” Transactions of the Association for Computational Linguistics 9 (2021): 346-361.

MacColl, Hugh. “Symbolic Reasoning.” Mind, vol. 6, no. 24, 1897, pp. 493–510. JSTOR, Accessed 13 May 2023.

Gao, Luyu, et al. “PAL: Program-aided Language Models.” arXiv preprint arXiv:2211.10435 (2022).

Zhang, Hanlin, et al. “The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning.” arXiv preprint arXiv:2212.08686 (2022).

He-Yueya, Joy, et al. “Solving Math Word Problems by Combining Language Models With Symbolic Solvers.” arXiv preprint arXiv:2304.09102 (2023).