← Back to Library

Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs

Authors: Matthew P. Lad, Louisa Conwill, Megan Levis Scheirer

Published: 2026-02-03

arXiv ID: 2602.03155v1

Added to Library: 2026-02-04 03:03 UTC

📄 Abstract

With the rapid growth of Large Language Models (LLMs), criticism of their societal impact has also grown. Work in Responsible AI (RAI) has focused on the development of AI systems aimed at reducing harm. Responding to RAI's criticisms and the need to bring the wisdom traditions into HCI, we apply Conwill et al.'s Virtue-Guided Technology Design method to LLMs. We cataloged new ethical design patterns for LLMs and evaluated them through interviews with technologists. Participants valued that the patterns provided more accuracy and robustness, better safety, new research opportunities, increased access and control, and reduced waste. Their concerns were that the patterns could be vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design.

🔍 Key Points

  • Identified adversarial token position as a significant underexplored axis in jailbreak attacks on large language models (LLMs).
  • Showed that optimizing adversarial tokens in different positions (prefix vs. suffix) significantly affects the attack success rates (ASR) both in white-box and black-box scenarios.
  • Provided empirical evidence that fixed-position evaluations can underestimate jailbreak effectiveness, leading to potential misassessments of model safety.
  • Demonstrated that attention dynamics associated with adversarial token positions vary significantly, suggesting limitations in existing attention-based analysis frameworks.
  • Highlighted the urgent need for more comprehensive safety evaluations that include adversarial token placement considerations.

💡 Why This Paper Matters

This paper sheds light on the critical role that adversarial token position plays in the effectiveness of jailbreak attacks on LLMs. By demonstrating that traditional suffix-based optimization is not universally optimal, the authors challenge existing methodologies and propose a necessary shift in how robustness evaluations are conducted. This work is essential for improving the safety and security of LLMs, providing insights that could inform future designs of safety mechanisms.

🎯 Why It's Interesting for AI Security Researchers

This paper is of significant interest to AI security researchers due to its novel contributions to understanding jailbreak vulnerabilities in LLMs. It highlights potential weaknesses in current safety evaluation techniques and offers actionable insights that can enhance adversarial robustness assessments. By focusing on adversarial positioning, it opens up new avenues for research in model security and robustness, making it a critical addition to the field.

📚 Read the Full Paper