Synthetic data is generated to mimic real data, and its quality is crucial for various applications. The quality of synthetic data can be assessed based on three principles: individual plausibility, usefulness, and privacy. Individual plausibility refers to how realistic a single sample is, while usefulness refers to the overall quality of the dataset. Privacy is also an essential aspect, as synthetic data should not compromise the original data's confidentiality.
To evaluate synthetic data, several methods can be employed. At the sample level, a binary classification problem can be used to determine if a sample is synthetic or real. If the classifier achieves high accuracy, the synthetic samples are not realistic enough. At the dataset level, statistical distributions and visual inspections can be used to compare the synthetic and real datasets.
The Synthetic Data Vault package can be used to generate synthetic data, and various evaluation methods can be employed to assess its quality. These methods include classification evaluation, univariate distributions tests, visual inspections using dimension reduction techniques, and comparing the performance of models trained on synthetic and real data.
In the example provided, the synthetic data generated using the Synthetic Data Vault package failed to fool a classifier, indicating that the synthetic samples are not realistic enough. The univariate distributions tests also showed that only two out of four variables had similar distributions in the real and synthetic datasets. Visual inspections using dimension reduction techniques further highlighted the differences between the two datasets.
Finally, comparing the performance of models trained on synthetic and real data showed that the synthetic data failed to capture complex relationships between features, indicating that it is not useful for prediction tasks.
towardsdatascience.com
towardsdatascience.com
Create attached notes ...
