NETWORK INTRUSION DETECTION USING BIG DATA ANALYTICS: A PYSPARK AND HIVE APPROACH FOR UNSW-NB15
Abstract
Network intrusion detection remains a critical challenge as cyber threats continue to evolve in complexity and scale. This study investigates the application of big data analytics for intrusion detection using the UNSW-NB15 dataset. Apache Hive was used for large-scale querying and feature analysis, while PySpark was used for advanced analytics, including descriptive statistics, correlation, hypothesis testing, and dimensionality reduction. A RF classifier was developed and evaluated for both binary and multi-class intrusion detection tasks. The experimental results demonstrate a 99.99% accuracy in binary classification and 98.62% in multi-class classification, highlighting the effectiveness of combining Hive and PySpark for scalable intrusion detection. These findings underscore the importance of big data frameworks in strengthening cybersecurity defence systems
Keywords:
Cybersecurity, Network Intrusion Detection, Big Data Analytics, Apache Spark, Apache Hive, UNSW-NB15, Random Forest, Machine LearningDownloads
Published
DOI:
https://doi.org/10.5281/zenodo.17135943Issue
Section
How to Cite
License
Copyright (c) 2025 Daniel U. Okon

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Alrawashdeh T, Alhamid M. 2020. Intrusion detection system using machine learning International Journal of Advanced Computer Science and Applications, 11(6), 7-14. https://doi.org/10.14569/IJACSA.2020.0110602
Aminanto, E., Wibisono, H., & Adi, K. (2017). Intrusion Detection System Using Cloud Computing Data Mining Techniques Journal of Telecommunication, Electronic and Computer Engineering, vol. 9, no. 3–8, pp. 43–47.
Chen, X., Li, D., Chen, M., and Zou, D. (2019). Cybersecurity and privacy protection: Survey, taxonomy, and open issues. IEEE Communications Surveys & Tutorials, 21(3), 2333-2370. https://doi.org/10.1109/COMST.2019.2914962
Federal Bureau of Investigation (FBI). (2021). Internet Crime Complaint Centre (IC3) Report 2020. Retrieved from https://www.ic3.gov/Media/PDF/AnnualReport/2020_IC3Report.pdf
Wang, Y., Wang, J., Huang, L., & Yao, X. (2018). Intrusion detection system based on improved GMM algorithm Journal of Physics: Conference Series, 1096, 032023. https://doi.org/10.1088/1742-6596/1096/3/032023
Chowdhury, M., Zaharia, M., Ma, J., Jordan, M. I., and Stoica, I. (2011). Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 41(4), 98-109. https://doi.org/10.1145/2043164.2018448
Moustafa, N. (2015). The UNSW-NB15 dataset. Research Data Australia. Available from: https://researchdata.edu.au/the-unsw-nb15-dataset/1957529
Turney, S. (2022). Pearson correlation coefficient (r) | Guide and examples. Scribbr. https://www.scribbr.com/statistics/pearson-correlation-coefficient/
Volpi, G. F. (2020). The most gentle introduction to PCA. Towards Data Science. https://towardsdatascience.com/the-most-gentle-introduction-to-principal-component-analysis-9ffae371e93b
Li, J., Wu, Y., & Zhang, H. (2021). Deep learning methods for network intrusion detection: A survey. Computers & Security, 102, 102153. https://doi.org/10.1016/j.cose.2020.102153
Shone, N., Ngoc, T. N., Phai, V. D., & Shi, Q. (2022). Hybrid deep learning approach for network intrusion detection Journal of Information Security and Applications, 67, 103182. https://doi.org/10.1016/j.jisa.2022.103182
Zhang, Y., Sun, Y., & Lin, X. (2023). Scalable ML for big data intrusion detection in cloud environments Future Generation Computer Systems, 144, 85-97. https://doi.org/10.1016/j.future.2023.01.005
Alqahtani, A., & Wang, H. (2024). A survey on big data analytics for cybersecurity: Challenges and opportunities. IEEE Access, 12, 5573-5590. https://doi.org/10.1109/ACCESS.2024.3349557