Evaluating the logical reasoning capabilities of Large Language Models when faced with conclusions that contradict common beliefs
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora.
We benchmark state-of-the-art models—including GPT models, Claude models, and leading Japanese LLMs—revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs.
5,000 carefully curated belief-inconsistent syllogisms in Japanese
Systematic comparison of 8 state-of-the-art LLMs
Detailed investigation of reasoning biases and failure modes
Critical insights for high-stakes applications
The BIS dataset consists of 5,000 carefully constructed syllogistic reasoning problems designed to test the robustness of logical inference in LLMs under conditions of belief inconsistency. Each example comprises two premises and one conclusion that is strictly entailed by syllogistic rules, but deliberately conflicts with general knowledge.
This example illustrates a belief-inconsistent syllogism where the conclusion is logically valid but contradicts common real-world beliefs about ceramics and biomass fuel.
The dataset covers 46 distinct semantic categories, consolidated into 10 broader final categories including Human/Body/Senses, Animals/Organisms, Structure/Logic, and Natural Phenomena/Matter.
Model | BIS Accuracy (%) | NeuBAROCO Accuracy (%) |
---|---|---|
GPT-4o | 79.54 | 94.01 |
llm-jp-3-13b | 59.86 | 67.66 |
GPT-4-turbo | 59.48 | 67.66 |
llm-jp-3-13b-instruct3 | 40.90 | 38.32 |
stockmark-13b | 40.34 | 47.90 |
Claude-3-sonnet | 20.34 | 78.44 |
Claude-3-opus | 7.18 | 61.07 |
Chain-of-thought prompting achieved 87% accuracy improvement on previously failed samples, demonstrating GPT-4o's latent reasoning capabilities when explicitly guided.
Prompts emphasizing logical evaluation and belief inconsistency achieved 76% improvement, showing sensitivity to explicit instructional framing.
English prompts showed similar patterns but with less pronounced gaps, likely due to GPT-4o's extensive English training.
Significant variance in performance across models, with GPT-4o leading at 79.54% accuracy while Claude models underperformed dramatically on BIS despite strong NeuBAROCO results.
LLMs struggle disproportionately with belief-inconsistent problems, often overriding logical inference in favor of plausibility heuristics.
Strategic prompt design significantly impacts performance, with chain-of-thought and logic-focused instructions dramatically improving accuracy.
Model size alone doesn't guarantee reasoning performance. Training approach and architectural biases are more critical factors.
Critical vulnerabilities revealed for deployment in law, healthcare, and scientific research where logical consistency is paramount.
Need for bias-resistant model design and comprehensive evaluation beyond standard benchmarks for reliable AI systems.
Read the full academic paper with detailed methodology and analysis
Download Paper (PDF)@article{nguyen2025bis,
title={BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning},
author={Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi},
year={2025},
eprint={2506.06955},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.06955},
}
For questions about the research or dataset:
Corresponding Author: Ha-Thanh Nguyen
Email: nguyenhathanh@nii.ac.jp
Affiliation: Research and Development Center for Large Language Models, NII, Tokyo, Japan