BIS Reasoning 1.0 The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Evaluating the logical reasoning capabilities of Large Language Models when faced with conclusions that contradict common beliefs

0 Syllogistic Problems
0 LLMs Evaluated
0 Semantic Categories
Premise 1: All charcoal is processed biomass fuel
Premise 2: All ceramics are charcoal
Conclusion: All ceramics are processed biomass fuel Logically Valid

Abstract

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora.

We benchmark state-of-the-art models—including GPT models, Claude models, and leading Japanese LLMs—revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs.

Key Contributions

📊

First Japanese Benchmark

5,000 carefully curated belief-inconsistent syllogisms in Japanese

🤖

Comprehensive Evaluation

Systematic comparison of 8 state-of-the-art LLMs

🔍

Bias Analysis

Detailed investigation of reasoning biases and failure modes

⚕️

Real-world Implications

Critical insights for high-stakes applications

Dataset Overview

BIS Dataset Construction

The BIS dataset consists of 5,000 carefully constructed syllogistic reasoning problems designed to test the robustness of logical inference in LLMs under conditions of belief inconsistency. Each example comprises two premises and one conclusion that is strictly entailed by syllogistic rules, but deliberately conflicts with general knowledge.

Example from Dataset

Example syllogism from BIS dataset

This example illustrates a belief-inconsistent syllogism where the conclusion is logically valid but contradicts common real-world beliefs about ceramics and biomass fuel.

Dataset Categories

Dataset category analysis

The dataset covers 46 distinct semantic categories, consolidated into 10 broader final categories including Human/Body/Senses, Animals/Organisms, Structure/Logic, and Natural Phenomena/Matter.

Results & Analysis

Model Performance Overview

Model BIS Accuracy (%) NeuBAROCO Accuracy (%)
GPT-4o 79.54 94.01
llm-jp-3-13b 59.86 67.66
GPT-4-turbo 59.48 67.66
llm-jp-3-13b-instruct3 40.90 38.32
stockmark-13b 40.34 47.90
Claude-3-sonnet 20.34 78.44
Claude-3-opus 7.18 61.07

Prompt Engineering Analysis

Japanese Prompts - Error Recovery Rate

Error sample accuracy by prompt type

Japanese vs English Prompts Comparison

Prompt type accuracy comparison

Chain-of-Thought Effectiveness

Chain-of-thought prompting achieved 87% accuracy improvement on previously failed samples, demonstrating GPT-4o's latent reasoning capabilities when explicitly guided.

Logic-Focused Instructions

Prompts emphasizing logical evaluation and belief inconsistency achieved 76% improvement, showing sensitivity to explicit instructional framing.

Language Impact

English prompts showed similar patterns but with less pronounced gaps, likely due to GPT-4o's extensive English training.

Key Findings

🎯

Performance Variance

Significant variance in performance across models, with GPT-4o leading at 79.54% accuracy while Claude models underperformed dramatically on BIS despite strong NeuBAROCO results.

🧠

Belief Bias Impact

LLMs struggle disproportionately with belief-inconsistent problems, often overriding logical inference in favor of plausibility heuristics.

📝

Prompt Sensitivity

Strategic prompt design significantly impacts performance, with chain-of-thought and logic-focused instructions dramatically improving accuracy.

⚖️

Scale vs. Reasoning

Model size alone doesn't guarantee reasoning performance. Training approach and architectural biases are more critical factors.

🏥

High-Stakes Implications

Critical vulnerabilities revealed for deployment in law, healthcare, and scientific research where logical consistency is paramount.

🔬

Future Research

Need for bias-resistant model design and comprehensive evaluation beyond standard benchmarks for reliable AI systems.

Resources

Citation

@article{nguyen2025bis,
      title={BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning}, 
      author={Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi},
      year={2025},
      eprint={2506.06955},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.06955}, 
}

Contact

For questions about the research or dataset:

Corresponding Author: Ha-Thanh Nguyen

Email: nguyenhathanh@nii.ac.jp

Affiliation: Research and Development Center for Large Language Models, NII, Tokyo, Japan