Synthetic Data Generation: Tabular Models and Privacy

Synthetic Data Generation (SDG) is the process of creating artificial data that maintains the statistical properties of a real dataset while protecting sensitive individual records. In tabular data contexts, this is primarily achieved through generative models that learn the joint probability distribution of the input features.

1. The SDV Framework (Synthetic Data Vault)

The **Synthetic Data Vault (SDV)** is the industry-standard ecosystem for generating synthetic tabular data. It provides high-level abstractions for training and evaluating generative models.

1.1 CTGAN (Conditional Tabular GAN)

Standard GANs struggle with tabular data due to non-Gaussian distributions and categorical variables. **CTGAN** addresses this via:

* **Mode-specific Normalization:** Handles multi-modal continuous distributions using a Variational Gaussian Mixture model (VGM).

* **Conditional Generator:** Allows sampling from specific categories (e.g., "Gender=Female") to ensure the discriminator sees enough minority class samples during training.

1.2 TVAE (Tabular Variational Autoencoder)

TVAE uses a modified VAE architecture to handle the discrete/continuous mix of tabular data. It is often faster to train than CTGAN and more robust to "Mode Collapse," though it may produce slightly "blurrier" statistical relationships.

2. Differential Privacy (DP) in Synthesis

Generating data from a model trained on private records still carries the risk of **Membership Inference Attacks**. To mitigate this, we integrate **Differential Privacy**.

2.1 The DP Mechanism

We inject calibrated noise into the training process (e.g., via DP-SGD).

* **Epsilon ($\epsilon$):** The "privacy budget." Smaller $\epsilon$ means more privacy but lower data fidelity.

* **Delta ($\delta$):** The probability of the privacy guarantee failing (usually set to $10^{-5}$ or lower).

2.2 The Fidelity-Privacy Trade-off

A strictly private model ($\epsilon < 1$) will often fail to capture complex correlations (e.g., the relationship between age and income). High-fidelity models ($\epsilon > 10$) provide great utility for ML training but offer weaker legal/mathematical guarantees against re-identification.

3. Evaluation Metrics

1. **Statistical Fidelity:** Using metrics like the Kolmogorov-Smirnov test for continuous columns and Chi-Squared for categorical ones.

2. **ML Efficacy:** Train a model on synthetic data, test it on real data. If the performance is comparable to training on real data, the synthesis is successful.

3. **Privacy Protection:** Measuring "Distance to Nearest Neighbor" (DCR) to ensure synthetic rows aren't just slightly perturbed copies of real rows.

4. Practical Implementation

Using SDV, a typical workflow involves:

1. **Metadata Detection:** Defining primary keys and column types.

2. **Model Training:** Fitting a `CTGAN` or `GaussianCopula` synthesizer.

3. **Sampling:** Generating a new dataset of arbitrary size.

4. **Audit:** Running a `DiagnosticReport` to verify statistical adherence.