Comprehensive Analysis of Machine Learning and Deep Learning models on Prompt Injection Classification using Natural Language Processing techniques

Bhavvya Jain; Pranav Pawar; Dhruv Gada; Tanish Patwa; Pratik Kanani; Deepali Patil; Lakshmi Kurup

doi:10.54392/irjmt2523

Authors

Bhavvya Jain Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author https://orcid.org/0009-0009-6271-2736
Pranav Pawar Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author
Dhruv Gada Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author
Tanish Patwa Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author
Pratik Kanani Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author https://orcid.org/0000-0002-6848-2507
Deepali Patil Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author https://orcid.org/0000-0001-5835-3237
Lakshmi Kurup Dwarkadas J. Sanghvi College of Engineering, Mumbai, India Author

DOI:

https://doi.org/10.54392/irjmt2523

Keywords:

Prompt Injection, Large Language Models, Text Classification, Vectorization Techniques, Machine Learning, Deep Learning

Abstract

This study addresses the prompt injection attack based vulnerability in large language models, which poses a significant security concern by allowing unauthorized commands by attackers to manipulate the outputs produced by model. Text classification methods used for detecting these malicious prompts are investigated on the prompt injection dataset obtained from Hugging Face datasets, utilizing a combination of natural language processing-based techniques applied on various machine learning and deep learning algorithms. Multiple vectorization approaches, like the Term Frequency-Inverse Document Frequency, Word2Vec, Bag of Words, and embeddings, are implemented to transform textual data into meaningful representations. The performance of several classifiers is assessed, on their ability to identify between malicious and non-malicious prompts. The Recurrent Neural Network model demonstrated high accuracy, achieving a detection rate of 94.74%. Obtained results indicated that deep learning architectures, particularly those that capture sequential dependencies, are highly effective in identifying prompt injection threats. This study contributes to the evolving field of AI security by addressing the issue of defending LLM based systems against adversarial threats in form of prompt injections. The findings highlight the importance of integrating sequential dependencies and contextual understanding in combatting LLM vulnerabilities. By the application of reliable detection mechanisms, this study enhances the security, integrity, and trustworthiness of AI-driven technologies, ensuring their safe use across diverse applications.

References

A. Haleem, M. Javaid, R.P. Singh, An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges. BenchCouncil transactions on benchmarks, standards and evaluations, 2(4), (2022) 100089. https://doi.org/10.1016/j.tbench.2023.100089

G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut….. L. Blanco, (2023) Gemini: a family of highly capable multimodal models. arXiv preprint. https://doi.org/10.48550/arXiv.2312.11805

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, (2023) Llama: Open and efficient foundation language models. arXiv preprint. https://doi.org/10.48550/arXiv.2302.13971

J. Devlin, (2018) Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint. https://doi.org/10.48550/arXiv.1810.04805

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zeroshot text-to-image generation." In International conference on machine learning, (2021) 8821-8831.

R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H. T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H.S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, V. Zhao, Y. Zhou, C.C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Meier-Hellstern, M.R. Morris, T. Doshi, R.D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E.H. John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B.A. Arcas, C. Cui, M. Croak, E. Chi, Q. Le, (2022) Lamda: Language models for dialog applications. arXiv preprint. https://doi.org/10.48550/arXiv.2201.08239

R. Bommasani, D.A. Hudson, E. Adeli, R. Altman, S. Arora, S. Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, E. Brynjolfsson, S. Buch, D. Card, R. Castellon, N. Chatterji, A. Chen, K. Creel, J.Q. Davis, D. Demszky, C. Donahue, M. Doumbouya, E. Durmus, S. Ermon, J. Etchemendy, K. Ethayarajh, L.F. Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D.E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P.W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M.Lee, T. Lee, J. Leskovec, I. Levent, X. Lisa Li, X. Li, T. Ma, A. Malik, C.D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J.C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J.S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A.W. Thomas, F. Tramèr, R.E. Wang, W.Wang, B. Wu, J. Wu, Y. Wu, S.M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, P. Liang, (2021) On the opportunities and risks of foundation models. arXiv preprint. https://doi.org/10.48550/arXiv.2108.07258

J. DeYoung, S. Jain, N.F. Rajani, E. Lehman, C. Xiong, R. Socher, B.C. Wallace, (2019) ERASER: A benchmark to evaluate rationalized NLP models. arXiv preprint https://doi.org/10.48550/arXiv.1911.03429

G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, Y. Liu, Masterkey: Automated jailbreaking of large language model chatbots. Network and Distributed System Security (NDSS) Symposium, (2024) 1-16.

F. Wu, N. Zhang, S. Jha, P. McDaniel, C. Xiao (2024). A new era in llm security: Exploring security concerns in real-world llm-based systems. arXiv preprint arXiv:2402.18649. https://doi.org/10.48550/arXiv.2402.18649

Y. Liu, G. Deng, Y. Li, K. Wang, Z. Wang, X. Wang, T. Zhang, Y. Liu, H. Wang, Y. Zheng, Y. Liu, (2023) Prompt Injection attack against LLM Integrated Applications. arXiv preprint. https://doi.org/10.48550/arXiv.2306.05499

N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, K. Lee, A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, C. Raffel, Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), (2021) 2633-2650.

J. Geiping, L. Fowl, W. R. Huang, W. Czaja, G. Taylor, M. Moeller, T. Goldstein, (2020) Witches' brew: Industrial scale data poisoning via gradient matching. arXiv preprint. https://doi.org/10.48550/arXiv.2009.02276

E. Wallace, S. Feng, N. Kandpal, M. Gardner, S. Singh, (2019) Universal adversarial triggers for attacking and analyzing NLP. arXiv preprint. https://doi.org/10.48550/arXiv.1908.07125

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan M. Diab, X. Li, X.V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P.S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer, (2022) Opt: Open pretrained transformer language models. arXiv preprint. https://doi.org/10.48550/arXiv.2205.01068

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint.

J. Howard, S. Ruder, (2018) Universal language model fine-tuning for text classification. arXivpreprint.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P.J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140), (2020) 167.

S.S. Kumar, M.L. Cummings, A. Stimpson, (2024) Strengthening LLM Trust Boundaries: A Survey of Prompt Injection Attacks. IEEE 4th International Conference on Human-Machine Systems (ICHMS), IEEE, Canada. https://doi.org/10.1109/ICHMS59971.2024.10555871

K. Greshake, A. Sahar, M. Shailesh, E. Christoph, H. Thorsten, F. Mario, Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, (2023) 79-90. https://doi.org/10.1145/3605764.3623985

J. Yi, Y. Xie, B. Zhu, E. Kiciman, G. Sun, X. Xie, F. Wu, (2023) Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv preprint. https://doi.org/10.48550/arXiv.2312.14197

F. Perez, I. Ribeiro, (2022) Ignore previous prompt: Attack techniques for language models. arXiv preprint. https://doi.org/10.48550/arXiv.2211.09527

X. Sang, M. Gu, H. Chi, (2024) Evaluating prompt injection safety in large language models using the promptbench dataset.

V. Benjamin, E. Braca, I. Carter, H. Kanchwala, N. Khojasteh, C. Landow, Y. Luo, C. Ma, A. Magarelli, R. Mirin, A. Moyer, K. Simpson, A. Skawinski, T. Heverin, (2024). Systematically Analyzing Prompt Injection Vulnerabilities in Diverse LLM Architectures. arXiv preprint. https://doi.org/10.48550/arXiv.2410.23308

Q. Zhan, L. Zhixiang, Y. Zifan, K. Daniel, (2024) Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv preprint. https://doi.org/10.48550/arXiv.2403.02691

J. Shi, Z. Yuan, Y. Liu, Y. Huang, P. Zhou, L. Sun, N. Z. Gong, (2024) Optimization-based Prompt Injection Attack to LLM-as-a-Judge. arXiv preprint.

P. Rai, S. Sood, V. K. Madisetti, A. Bahga, Guardian: A multi-tiered defense architecture for thwarting prompt injection attacks on llms. Journal of Software Engineering and Applications, 17(1), (2024) 43-68. https://doi.org/10.4236/jsea.2024.171003

N. Varshney, P. Dolin, A. Seth, C. Baral, (2023) The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint. https://doi.org/10.18653/v1/2024.findings-acl.776

X. Liu, Z. Yu, Y. Zhang, N. Zhang, C. Xiao, (2024) Automatic and universal prompt injection attacks against large language models. arXiv preprint. https://doi.org/10.48550/arXiv.2403.04957

X. Suo, Signed-Prompt: A New Approach to Prevent Prompt Injection Attacks Against LLMIntegrated Applications. arXiv preprint, 3194 (2024) 040013. https://doi.org/10.1063/5.0222987

J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, D. Wagner, (2024) Jatmo: Prompt injection defense by task-specific finetuning. Computer Security – ESORICS 2024, 105–124. https://doi.org/10.1007/978-3-031-70879-4_6

A. Robey, E. Wong, H. Hassani, G. J. Pappas, Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint. https://doi.org/10.48550/arXiv.2310.03684

S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, P.S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H.D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, P. Resnik, (2024) The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv preprint. https://doi.org/10.48550/arXiv.2406.06608

C. Wang, S.K. Freire, M. Zhang, J. Wei, J. Goncalves, V. Kostakos, Z. Sarsenbayeva, C. Schneegass, A. Bozzon, E. Niforatos, (2023) Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection. arXiv preprint. https://doi.org/10.48550/arXiv.2306.08833

D. Jurafsky, J. H. Martin, (2024) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. with Language Models, 3rd edition.

A.K. Uysal, G. Serkan, The impact of preprocessing on text classification. Information processing & management, 50(1), (2014) 104112. https://doi.org/10.1016/j.ipm.2013.08.006

W.J. Wilbur, K. Sirotkin, The automatic identification of stop words. Journal of information science 18(1), (1992) 45-55. https://doi.org/10.1177/016555159201800106

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12, (2011) 2825-2830.

J. Ramos, Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, 242(1), (2003) 29-48.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, (2013) 26.

Y. Goldberg, O. Levy, (2014) word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint. https://doi.org/10.48550/arXiv.1402.3722

W.A. Qader, M.M. Ameen, B.I. Ahmed, (2019) An Overview of Bag of Words;Importance, Implementation, Applications, and Challenges. In 2019 international engineering conference (IEC), IEEE, Iraq. https://doi.org/10.1109/IEC47844.2019.8950616