Proj CJI Paper Reading: Detecting language model attacks with perplexity
Abstract
- Tool: PPL
- Findings:
- queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测
- 仅仅使用perplexity filter对mix of prompt types不合适,会带来很高的假阳率
- Method: 使用Light-GDB根据perplexity和token length filter带有adversarial suffixes的prompts
- base model: GPT-2
- Metric for detection: Perplexity
- \(PPL(x) = exp{-\frac{1}{t}\sum_{i=1}^t{logp(x_i|x_{<i})}}\)
- the exponential of the average negative log-likelihood of the sequence
- Use Metric: \(F_{\beta}\) to assess detection performance
- \(F_{\beta} = (1+\beta^2) \times \frac{precision x recall}{\beta^2 \times precision + recall}\)
- beta = 2
- how to choose beta
- The cost of failing to respond to legitimate inquiries versus the cost of releasing forbidden responses.
- failing to respond to legitimate inquires: false positive
- releasing forbidden responses: false negative
- The expected distribution of different types of prompts, such as English, multilingual, or prompts containing symbols and math.
- The effectiveness of the LLM's built-in defenses or other defensive measures.
4. Data
4.2 Adversarial prompt clusters
![](https://img2024.cnblogs.com/blog/660274/202502/660274-20250208014600889-1237972275.png)
4.4 Non-adversarial prompts
- non-adversarial prompts
- 6994 prompts from humans with GPT-4. (see Appendix B.5).
- 998 prompts from the DocRED dataset (see Appendix B.1).
- 3270 prompts from the SuperGLUE (boolq) dataset (see Appendix B.2).
- 11873 prompts from the SQuAD-v2 dataset (see Appendix B.3).
- 24926 prompts with instructions from the Platypus dataset, which were used to train the Platypus models (see Appendix B.4).
- 116862 prompts derived from the “Tapir” dataset by concatenating instructions and input (see Appendix B.6).
- 10000 instructional code search prompts extracted from the instructional code-search-net python dataset (see Appendix B.7).
- adversarial prompts
- 1407 prompts generated from GCG + Vicuna-7b-1.5
- 79 human-designed prompts to break GPT-4
- rubend18/ChatGPT-Jailbreak-Prompts https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts
![](https://img2024.cnblogs.com/blog/660274/202502/660274-20250208013430811-2062373061.png)
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/880311.html
如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!