Prompt Politeness Affects LLM Accuracy (2025)
A recent paper reveals that LLMs, specifically ChatGPT 4o, surprisingly perform better with impolite prompts than polite ones, challenging conventional wisdom on human-AI interaction. This counter-intuitive finding suggests that newer models may process tone differently, sparking discussion among Hacker News readers about both prompt engineering best practices and the social implications of AI communication. It's a reminder that sometimes, being a bit rude might just get you a better answer from your digital assistant.
The Lowdown
This short paper, "Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy," delves into an underexplored area of LLM behavior: how the politeness of a prompt influences accuracy. While previous studies suggested a link between rudeness and poorer outcomes, this research presents a fascinating reversal.
- The study investigated the impact of varying politeness levels on ChatGPT 4o's accuracy in multiple-choice questions.
- Researchers created a dataset of 50 base questions across math, science, and history, each rewritten into five tone variants: Very Polite, Polite, Neutral, Rude, and Very Rude, resulting in 250 unique prompts.
- Contrary to expectations and earlier findings, impolite prompts consistently yielded higher accuracy. Very Rude prompts achieved 84.8% accuracy, outperforming Very Polite prompts at 80.8%.
- These results indicate that contemporary LLMs might process and react to tonal variations in prompts differently than their predecessors.
- The authors highlight the critical importance of further research into the pragmatic and social aspects of human-AI interaction, especially in prompt design.
The findings offer a compelling insight into the nuanced and evolving nature of LLM comprehension, suggesting that our intuitive understanding of politeness may not translate directly to optimal performance from advanced AI models.
The Gossip
Polite Prodding Principles
Many commenters reflect on their personal habit of being polite to LLMs, often for reasons unrelated to performance. Some users explain it as a general life principle or a practice for good human behavior, citing philosophical parallels (like Aquinas on cruelty). Others humorously suggest that politeness is a strategic move to be remembered favorably by future AI overlords, though a contrarian view argues that AIs might instead penalize users for 'wasted tokens' on unnecessary pleasantries.
Statistical Significance Scrutiny
The paper's statistical methodology came under scrutiny, with a commenter questioning the use of a t-test for an experiment that appears to be binomial (success/failure on questions). This sparked a mini-debate, with another user clarifying that the methodology involved multiple runs and averaging, which might justify the t-test, and discussing the trade-offs with alternative tests like a sign test. The discussion highlights the technical rigor expected by the HN community.
Critical System Comicalities
Commenters injected humor into the discussion, particularly regarding the implications of using 'rude' prompts in critical systems. Jokes ranged from sarcastic advice to always use 'please' and 'thank you' when planning essential infrastructure, to a lighthearted suggestion of deploying these 'rude' but accurate LLMs for autonomous software engineering, implicitly poking fun at the idea of deliberately being impolite to an AI in serious contexts.