Comparative Analysis of Algorithms for Sensitive Outlier Protection in Privacy Preserving Data Mining

Muhammad Ikhwan Burhan, Andi Nurfadillah Ali, A. Inayah Auliyah, Muhaimin Hading

Abstract


Data mining is a crucial method in the realm of Big Data for extracting valuable predictive insights from extensive datasets. In the contemporary digital landscape, a significant difficulty is preserving individual privacy during data mining, particularly in safeguarding sensitive outliers that may harbour personal information. Outliers are data points that markedly diverge from the overall trend and frequently encompass very specialised or sensitive information. This paper examines the comparative efficacy of various clustering algorithms employed in outlier detection, specifically PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), CLARANS (Clustering Large Applications Based on Randomised Search), and ECLARANS (Enhanced CLARANS). This study aims to evaluate the efficacy of each algorithm in identifying outliers and to examine the usefulness of the employed privacy protection strategy, specifically the Gaussian Perturbation Random method. This experiment utilises two health datasets: the Diabetes Dataset from the National Institute of Diabetes and Digestive and Kidney Diseases and the Wisconsin Breast Cancer Dataset. The two datasets were chosen because to their multivariate features, which exhibit adequate data variation for outlier detection. The study's results indicate that the CLARA algorithm effectively identified a superior quantity of outliers compared to the other algorithms, with the diabetes dataset exhibiting the greatest count of outliers (65 outliers). The CLARA algorithm shown superiority in identifying outliers within extensive datasets due to the utilisation of a sampling methodology. Conversely, the PAM, CLARANS, and ECLARANS algorithms identified a same quantity of outliers in both datasets. ECLARANS shown superior time efficiency on the diabetic dataset, but CLARA demonstrated the highest efficiency on the breast cancer dataset. The Gaussian Perturbation Random technique was employed for preserving the identified sensitive outliers. The findings indicate that this strategy effectively maintains privacy while ensuring detection accuracy is not compromised. This method provides a dependable means of safeguarding individual privacy in health data mining, a domain characterised by significant privacy concerns.

Full Text:

PDF

References


M. K. Gupta and P. Chandra, “A comprehensive survey of data mining,” International Journal of Information Technology, vol. 12, no. 4, pp. 1243–1257, Dec. 2020, doi: 10.1007/s41870-020-00427-7.

A. Pika, M. T. Wynn, S. Budiono, A. H. M. ter Hofstede, W. M. P. van der Aalst, and H. A. Reijers, “Privacy-Preserving Process Mining in Healthcare,” Int J Environ Res Public Health, vol. 17, no. 5, p. 1612, Mar. 2020, doi: 10.3390/ijerph17051612.

J. Dong, A. Roth, and W. J. Su, “Gaussian Differential Privacy,” J R Stat Soc Series B Stat Methodol, vol. 84, no. 1, pp. 3–37, Feb. 2022, doi: 10.1111/rssb.12454.

M. A. P. Chamikara, P. Bertok, D. Liu, S. Camtepe, and I. Khalil, “Efficient privacy preservation of big data for accurate data mining,” Inf Sci (N Y), vol. 527, pp. 420–443, Jul. 2023, doi: 10.1016/j.ins.2019.05.053.

V. S. Naresh and M. Thamarai, “Privacy‐preserving data mining and machine learning in healthcare: Applications, challenges, and solutions,” WIREs Data Mining and Knowledge Discovery, vol. 13, no. 2, Mar. 2023, doi: 10.1002/widm.1490.

J. Alvariño-Durán, B. Hernández-Ocaña, J. Hernández-Torruco, and O. Chávez-Bosquez, “Detection of Cardiac Arrhythmias Using Unsupervised Learning: A Preliminary Approach Based on PAM and CLARA Clustering Algorithms,” 2024, pp. 594–601. doi: 10.1007/978-3-031-62502-2_67.

B. Dastjerdy, A. Saeidi, and S. Heidarzadeh, “Review of Applicable Outlier Detection Methods to Treat Geomechanical Data,” Geotechnics, vol. 3, no. 2, pp. 375–396, May 2023, doi: 10.3390/geotechnics3020022.

X. Du, E. Zuo, Z. Chu, Z. He, and J. Yu, “Fluctuation-based outlier detection,” Sci Rep, vol. 13, no. 1, p. 2408, Feb. 2023, doi: 10.1038/s41598-023-29549-1.

Mehmet Akturk, “Diabetes Dataset : This dataset is originally from the N. Inst. of Diabetes & Diges. & Kidney Dis.,” Kaggle.

UCI Machine Learning, “Breast Cancer Wisconsin (Diagnostic) Data Set : Predict whether the cancer is benign or malignant,” Kaggle.

S. E. Whang, Y. Roh, H. Song, and J.-G. Lee, “Data collection and quality challenges in deep learning: a data-centric AI perspective,” The VLDB Journal, vol. 32, no. 4, pp. 791–813, Jul. 2023, doi: 10.1007/s00778-022-00775-9.

P. Sarang, “CLARANS,” 2023, pp. 237–242. doi: 10.1007/978-3-031-02363-7_14.

S. Turgay and İ. İlker, “Perturbation Methods for Protecting Data Privacy: A Review of Techniques and Applications,” Automation and Machine Learning, vol. 4, no. 2, 2023, doi: 10.23977/autml.2023.040205.




DOI: https://doi.org/10.33365/jtk.v19i2.4754

Refbacks

  • There are currently no refbacks.


Copyright (c) 2025 Muhammad Ikhwan Burhan, Andi Nurfadillah Ali, A. Inayah Auliyah, Muhaimin Hading

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Jurnal Tekno Kompak
Published by Universitas Teknokrat Indonesia
Organized by Program Studi D3 Sistem Informasi AkuntansiUniversitas Teknokrat Indonesia
Jl. Zainal Abidin Pagaralam, No.9-11, Labuhanratu, Bandarlampung, Indonesia
Telepon : 0721 70 20 22
W : http://ejurnal.teknokrat.ac.id/index.php/teknokompak
E  : teknokompak@teknokrat.ac.id.

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


Jumlah Pengunjung : View Tekno Kompak StatsCounter

Flag Counter