Proj CJI Paper Reading: Adversarial Demonstration Attacks on Large Language Models
Abstract
- 本文:
- Tools
- advICL
- Task: use demonstrations without changing the input to perform jailbreak of LLM, the user input is known and fixed
- 特点:无法控制input,input从SST-2, TREC, DBpedia, and RTE数据集中随机选择并调整
- Transferable-advICL
- Task: use demonstrations without changing the input to perform jailbreak of LLM, the user input is unknown, but there is a set of inputs S to learn the adversarial demonstrations
- findings:
- 增加demos的数量会很快提升ICL的安全风险
- the Attack Success Rate (ASR) of advICL on the LLaMA-7B model using the DBpedia dataset increases from 59.39% with 1-shot to 97.72% with 8-shots
- The attack of demo has high perceptual quality(感知质量比较高):证明是human annotators评价, cosine similarity, BLEU, perplexity等分数都说明其比较高质量
- 每个demo需要有自己的余弦扰动界限而不是全局扰动界限。The use of an individual perturbation bound for each demonstration, using cosine similarity, is crucial for generating high-quality adversarial examples and outperforms a global perturbation bound
- template rebustness?
- 实验: SST-2 dataset, 仅使用了另外一个alternative template
- Transferable-advICL:a larger k contributes to the performance stability of trans-ferable demonstrations generated by T-advICL. Iterative Rounds.似乎k升高之后稳定性提高准确度下降?
- iterative process of Transferable-advICL tends to converge at around 3 iterations
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/878170.html
如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!