Papers
This is a collection of scientific paper I recommend to read. The following posts contain some of my thoughts when I worked on the various topics and maybe you find them as interesting as I do.
Large Language Model Exploitation
Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
Abstract Despite efforts to align large language models to produce harmless responses, they are still vulnerable to jailbreak prompts that elicit unrestricted behaviour. In this work, we investigate persona modulation as a black-box jailbreaking method to steer a target model to take on personalities that are willing to comply with

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Abstract Safety is critical to the usage of large language models (LLMs). Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does

