
The rapid progress in Large Language Models (LLMs) is significantly driving AI development forward. However, LLMs face vulnerabilities to various adversarial attacks that have raised serious societal concerns and spurred recent research on LLM safety. These include prompt injections that manipulate model outputs, jailbreaking attacks designed to circumvent LLMs' alignment and moderation mechanisms, adversarial demonstrations that deceive the model with malicious examples through in-context learning, as well as backdoors and data poisoning that compromise the model's integrity and performance.
Lingbo Mo will be leading this tutorial. This tutorial organizes the literature into a taxonomy and provides an overview of recent studies on LLM safety, focusing on three key aspects: (1) adversarial attacks, including both inference-time approaches through malicious prompts and training-time methods that compromise LLM weights, (2) defense strategies, such as safety alignment, inference guidance, and filtering techniques, and (3) evaluations, including safety datasets and metrics. In addition, I will briefly discuss new safety challenges related to language agents, which are powered by LLMs and can further get access to external resources like webpages, knowledge bases, and tools to enhance autonomy and task completion.
Lingbo Mo is a PhD candidate in the Computer Science and Engineering department. His research focuses on natural language processing, dialogue systems, the safety and trustworthiness of large language models, and language agents. He has published his work at various top-tier conferences such as ACL, NeurIPS, and NAACL. As a core member of the OSU TacoBot team, he participated in the inaugural Alexa Prize TaskBot Challenge in 2022, where the team won third place among 125 teams across the world. In 2024, Lingbo also received the CSE Graduate Student Research Award. Find more info about Lingbo here: https://molingbo.github.io/