DEV Community

Rescuing the Signal: How PCA Salvages Accuracy from Catastrophic Data Poisoning

The project investigates the "Garbage In, Garbage Out" problem in machine learning using the Scikit-Learn Digits Dataset of handwritten digits. The study intentionally corrupted the dataset with high levels of Gaussian noise to simulate real-world data imperfections. Three machine learning models, Gaussian Naive Bayes, K-Nearest Neighbors, and a Multi-Layer Perceptron, were tested on the noisy data. Performance of all models drastically decreased, dropping to near-random guessing, highlighting the impact of poor data quality. Principal Component Analysis (PCA) was employed to denoise the data and mitigate the effects of the added noise. PCA was configured to retain 80% of the variance, effectively filtering out the random noise. The KNN and MLP models demonstrated significant recovery in accuracy after the application of PCA. Gaussian Naive Bayes saw improvement, but did not recover as fully due to its assumption of independent pixels. The project demonstrated the effectiveness of PCA as a data remediation technique. Future research could explore Convolutional Neural Networks, potentially eliminating the need for separate denoising steps. The project's code is available on the author's GitHub repository. The study's results exemplify the importance of data cleaning and pre-processing in machine learning. This work reinforces the need to address data quality challenges for robust and reliable model performance. The research underscores the value of using mathematical techniques to improve data before model training.
favicon
dev.to
dev.to
Image for the article: Rescuing the Signal: How PCA Salvages Accuracy from Catastrophic Data Poisoning
Create attached notes ...