What is the information leakage of an iterative randomized learning algorithm about its training data, when the internal state of the algorithm is private? How much is the contribution of each specific training epoch to the information leakage through the released model? We study this problem for noisy gradient descent algorithms, and model the dynamics of Rényi differential privacy loss throughout the training process. Our analysis traces a provably tight bound on the Rényi divergence between the pair of probability distributions over parameters of models trained on neighboring datasets. We prove that the privacy loss converges exponentially fast, for smooth and strongly convex loss functions, which is a significant improvement over composition theorems (which over-estimate the privacy loss by upper-bounding its total value over all intermediate gradient computations). For Lipschitz, smooth, and strongly convex loss functions, we prove optimal utility with a small gradient complexity for noisy gradient descent algorithms. We provide additional details about our analysis in the full version >.
Machine learning models leak a significant amount of information about their training data, through their parameters and predictions
The general method to compute the differential privacy bound for gradient perturbation-based learning algorithms is to view the process as a number of (identical) differential privacy mechanisms, and to compute the composition of their bounds. However, this over-estimates the privacy loss of the released model
We present a novel analysis for privacy dynamics of noisy gradient descent with smooth and strongly convex loss functions. We construct a pair of continuous-time Langevin diffusion
This guarantee shows that the privacy loss converges exponentially in the number of iterations $K$, instead of growing proportionally with $K$ as in the composition-based analysis of the same algorithms. Our bound captures a strong privacy amplification due to the dynamics (and convergence) of differential privacy over the noisy gradient descent algorithm with private internal state.
Let $\D = (\x_1, \x_2, \cdots, \x_\size)$ be a dataset of size $\size$ with records taken from a universe $\Univ$. For a given machine learning algorithm, let ${\loss{\x}{\ptheta} : \Univ \times \thetaspace \rightarrow \R}$ be a loss function of a parameter vector $\ptheta \in \C$ on the data point $\x$, where $\C$ is a closed convex set (can be $\mathbb{R}^d$).
A generic formulation of the optimization problem to learn the model parameters, is in the form of empirical risk minimization (ERM) with the following objective, where $\Loss_\D(\ptheta)$ is the empirical loss of the model, with parameter vector $\ptheta$, on a dataset $\D$.
\[\ptheta^* = \underset{\ptheta \in \C}{\arg \min}~\Loss_\D(\ptheta), \quad \text{where} \quad \Loss_\D(\ptheta)=\frac{1}{\size} \sum_{\x \in \D}\loss{\x}{\ptheta}.\]Releasing this optimization output (i.e., $\ptheta^*$) can leak information about the dataset $\D$, hence violating data privacy. To mitigate this risk, there exist randomized algorithms to ensure that the ($\q$-Rényi) privacy loss of the ERM algorithm is upper-bounded by $\eps$, i.e., the algorithm satisfies ${(\q, \eps)\text{-RDP}}$.
In this paper, our objective is to model and analyze the dynamics of differential privacy of a popular randomized ERM algorithm called Noisy Gradient Descent
More precisely, we focus on the following.
We emphasize that our goal is to construct a theoretical framework for analyzing privacy loss of releasing the output $\ptheta_K$ of the algorithm, assuming private internal states (i.e., $\ptheta_1, \cdots, \ptheta_{K-1}$). In the end, we aim to provide a RDP bound that is tight, thus facilitating optimal empirical risk minimization with differential privacy
To analyze the privacy loss of noisy GD, which is a discrete-time stochastic process, we first interpolate each discrete update from $\ptheta_k$ to $\ptheta_{k+1}$ with a piece-wise continuously differentiable diffusion process. Let $\D$ and $\D’$ be a pair of arbitrarily chosen neighboring datasets. Given step-size $\step$ and initial parameter vector $\ptheta_0=\ptheta_0’$, the respective $k$’th discrete updates in Algorithm 1 on neighboring datasets $D$ and $D’$ are
\(\begin{aligned} \ptheta_{k+1} &= \proj{\C}{\ptheta_{k} - \step \nabla{\Loss_\D(\ptheta_k)} + \sqrt{2\step\sig^2} Z_k},\\\\ \ptheta_{k+1}' &= \proj{\C}{\ptheta_{k}'-\step\nabla{\Loss_\D(\ptheta_k')}+ \sqrt{2\step\sig^2} Z_k }, \end{aligned}\) with $Z_k \sim \mathcal{N}(0,\mathbb{I}_d)$.
These two discrete jumps can be interpolated with two stochastic processes $\thet_t$ and $\thet_t’$ over time $\step k \leq t\leq\step (k+1)$ respectively (visualized below via dotted lines).
At the start of each step, $t=\eta k$, the random variables $\thet_{\eta k}$ and $\thet_{\eta k}’$ model the distribution of the $\theta_k$ and $\theta_k’$ in the noisy GD processes respectively. Immediately after $t=\eta k$, the random variables undergo identical mapping $\phi(\cdot)$ defined as $\phi(\theta) = \theta - \eta U_+(\theta)$. During time $\eta k< t <\eta (k+1)$, the two process diffuse along opposing vector fields ($U_-(\cdot)$ and $-U_-(\cdot)$) with Brownian perturbation. At the end of step, i.e. at $t \rightarrow \eta (k+1)$, we project $\thet_t$ and $\thet_t’$ onto convex set $\C$, and obtain $\thet_{\eta (k + 1)}$ and $\thet_{\eta (k + 1)}’$.
Repeating the construction for $k=0,\cdots,K-1$, we define two piece-wise continuous diffusion processes $\{ \thet_t \}_{t \geq 0}$ and $\{ \thet_t’ \}_{t \geq 0}$ whose distributions at time $t=\step k$ are consistent with $\theta_k$ and $\theta_{k}’$ in the noisy GD processes for any $k \in \{0,\cdots, K-1\}$.
The Rényi divergence $R_{\alpha}(\thet_{\eta \K}\lVert\thet_{\eta \K}’)$ reflects the Rényi privacy loss of noisy GD with $\K$ steps. Conditioned on observing $\ptheta_k$ and $\theta_\kk’$, the processes $\{\Theta_t\}_{\step\kk< t< \step(\kk+1)}$ and $\{\Theta_t’\}_{\step\kk< t<\step(\kk+1)}$ are Langevin diffusions along vector fields $ - U_-(\ptheta_{\kk})$ and $U_-(\ptheta_{\kk}’)$ respectively, for duration $\step$. That is, conditioned on observing $\theta_\kk$ and $\theta_\kk’$, the two diffusion processes have the following stochastic differential equations (SDEs) respectively.
\[\dif{\thet_t} = - U_-(\theta_k)\dif{t} + \sqrt{2\sig^2}\deltW{t}, \quad \dif{\thet_t}' = U_-(\theta_k')\dif{t} + \sqrt{2\sig^2}\deltW{t},\]where $\deltW{t}\sim\sqrt{dt}\mathcal N(0,I_d)$ describe the Wiener processes on $\thetaspace$.
The joint effect of the drag force (i.e $- U_-(\theta_k)$) and Brownian fluctuations on the (conditional) probability density $p(\thet_t = \ptheta \mid \thet_{\eta k} = \ptheta_k)$ of random variable $\Theta_t$ given $\Theta_{\eta k} = \ptheta_k$ is characterized through the Fokker-Planck equation
\(\begin{aligned} \doh{p_{t|\step\kk}(\ptheta|\ptheta_\kk)}{t} &= \divR{\left(p_{t|\step\kk}(\ptheta|\ptheta_\kk) U_-(\theta_k)\right)} + \sig^2 \lapL{p_{t|\step\kk}(\ptheta|\ptheta_\kk)} \\\\ \doh{p_{t | \step \kk}'(\ptheta|\ptheta_\kk')}{t} &= - \divR{\left(p_{t|\step\kk}'(\ptheta|\ptheta_\kk')U_-(\theta_k')\right)} + \sig^2 \lapL{p_{t|\step\kk}'(\ptheta|\ptheta_\kk')} \end{aligned}\)
By taking expectations over probability density function $p_{\eta k}(\theta_\kk)$ or $p_{\eta k}’(\theta_\kk’)$ on both sides of above equation, we obtain the partial differential equation that models the evolution of (unconditioned) probability density function $p_t(\ptheta)$ and $p_t’(\ptheta)$ in the coupled tracing diffusions.
Since both these vector fields $V_t$ and $V_t’$ are expectations over $U_-(\cdot)$, the loss gradient difference vector between the two databases, it is easy to see that their magnitude is bounded by the sensitivity of total gradient.
\[\begin{aligned} \norm{V_t(\ptheta)} &\leq \frac{1}{2} \Expec{\ptheta_k \sim p_{\step k | t}}{\norm{\graD{\Loss_\D(\ptheta_k)} - \graD{\Loss_{\D'}(\ptheta_k)} } | \ptheta} \\\\ &\leq \frac{1}{2\size} \underbrace{\underset{\x, \x' \in \Univ}{\max} \underset{\ptheta \in \thetaspace}{\max} \norm{\graD{\loss{\x}{\ptheta}} - \graD{\loss{\x'}{\ptheta}}}}_{\text{Sensitivity $\sen{g}$ of loss gradient}} \end{aligned}\]As a result, the magnitude of the vector fields is also bounded, i.e. $\norm{V_t(\ptheta) - V_t’(\ptheta)} \leq \frac{\sen{g}}{\size}$ for all $t \geq 0$ and $\ptheta \in \thetaspace$.
Via these two density evolution equations, we analyze differential privacy dynamics of the noisy gradient descent updates along coupled tracing diffusions.
The Rényi divergence (privacy loss) $\Ren{\q}{\thet_t}{\thet_t’}$ between coupled tracing diffusion processes increases over time, as the vector fields $V_t, V_t’$ underlying two processes are different. We refer to this phenomenon as privacy erosion. This increase is determined by the amount of change in the probability density functions for coupled tracing diffusions, characterized by the Fokker-Planck equations for diffusions under different vector fields. Using it, we compute a bound on the rate (partial derivative) of $\Ren{\q}{\thet_t}{\thet_t’}$ over time in the following lemma, to model privacy erosion between two different diffusion processes.
Lemma 2 bounds the rate of privacy loss with various terms. Generally speaking, the term $\frac{\sen{g}^2}{4\sigma^2\size^2}$ bounds the worst-case privacy loss growth due to noisy gradient update, while the term $\frac{I_{\alpha}(\Theta_t\lVert\Theta_t’)}{E_{\alpha}(\Theta_t\lVert\Theta_t’)}$ amplifies our bound for the rate of privacy loss, as both $\Gren{\q}{\thet_{t}}{\thet_{t}’}$ and $\Fren{\q}{\thet_{t}}{\thet_{t}’}$ are always non-negative. Following is the breakdown of the terms in Lemma 2:
Since the ratio $I_\q/E_\q$ is always positive by definition, the Rényi divergence (privacy loss) rate is bounded by its first component (a constant) given any fixed $\q$. Therefore, this naïve privacy analysis gives linear RDP guarantee for Langevin diffusion, which resembles the moment accountant analysis
Although Lemma 2 bounds the Rényi privacy loss rate, the Rényi Information term $\Gren{\q}{\thet_{t}}{\thet_{t}’}$ depends on unknown distributions $\thet_{t}, \thet_{t}’$, and is intractable to control in general. Even with explicit expressions for distributions $\thet_{t}, \thet_{t}’$, the calculation would involve integration in $\thetaspace$ which is computationally prohibitive for large $d$. We show that when the loss function $\ell(\cdot)$ is $\lambda$-strongly convex and $\beta$-smooth, the $I_\q/E_\q$ term in Lemma 2 can be controlled along the coupled tracing diffusions. More specifically, by choosing a step size $\step \leq \frac{1}{\beta}$ and start parameter distribution $\thet_0, \thet_0’ \sim \proj{\C}{\mathcal{N} (0, \sig^2 I_d)}$, we show that for any $t \geq 0$,
\[\frac{\Gren{\q}{\thet_t}{\thet_t'}}{\Fren{\q}{\thet_t}{\thet_t'}} \geq \frac{\lambda} {\q^2\sigma^2}\times \left[\Ren{\q}{\thet_t}{\thet_t'} + \q(\q - 1) \doh{\Ren{\q}{\thet_t}{\thet_t'}}{\q}\right].\]We prove this inequality by showing that under the stated conditions, the coupled tracing diffusions satisfies the Gaussian isoperimetric inequality, also known as log-Sobolev inequality
On plugging the above inequality in Lemma 2, we get the following partial differential inequation (PDI) that models the dynamics of Rényi privacy loss, i.e. it describes how the Rényi privacy loss changes as a function of time $t$ and $\alpha$ during the interval $\eta k < t < \eta (k + 1)$. For brevity, let $R(\q,t)$ represent $\Ren{\q}{\thet_t}{\thet_t’}$. We have
\[\doh{R(\q,t)}{t} \leq \frac{1}{\gamma}\frac{\q S_g^2}{4\sig^2\size^2} - \lambda(1-\gamma) \left[ \frac{R(\q,t)}{\q} + (\q-1)\doh{R(\q,t)}{\q} \right].\]As per our coupled tracing diffusion process, no privacy loss is incurred at integral multiples of $\step$ as both the mapping $\phi(\cdot)$ and $\proj{\C}{\cdot}$ are identical for the coupled processes, and therefore can be seen as post-processing step (as shown below).
On solving the DP dynamics PDI and unrolling it over all the $K$ steps of iteration yields the privacy guarantee stated in Theorem 1. Figure below demonstrates how this RDP guarantee for noisy GD converges with the number of iterations $\K$. Through y-axis, we show the $\eps$ guaranteed for noisy GD under a reasonable choice of hyperparameters. The RDP order $\q$ linearly scales the asymptotic guarantee, but does not affect the convergence rate of RDP guarantee. However, the strong convexity parameter $\lambda$ positively affects the asymptotic guarantee as well as the convergence rate; the larger the strong convexity parameter $\lambda$ is, the stronger the asymptotic RDP guarantee and the faster the convergence.
Differential privacy guarantees reflect a bound on privacy loss on an algorithm; thus, it is very crucial to also have an analysis of their tightness (i.e., how close they are to the exact privacy loss). We prove that our guarantee in Theorem 1 is tight. To this end, we construct an instance of the ERM optimization problem, for which we show that the Rényi privacy loss of the noisy GD algorithm grows at an order matching our guarantee in Theorem 1.
It is very challenging to lower bound the exact Rényi privacy loss ${\Ren{\q}{\thet_{\K\step}}{\thet_{\K\step}’}}$ in general. This might require having an explicit expression for the probability distribution over the last-iterate parameters $\ptheta_\kk$. Computing a close-form expression is, however, feasible when the loss gradients are linear. This is due to the fact that, after a sequence of linear transformations and Gaussian noise additions, the parameters follow a Gaussian distribution. Therefore, we construct such an ERM objective, compute the exact privacy loss, and prove the following lower bound.
We prove this lower bound using the $\ell_2$-squared norm loss as ERM objective: $\underset{\theta\in\mathbb{R}^d}{\min}\sum_{i=1}^n\frac{\lVert\theta-\x_i\rVert_2^2}{n}$. We assume bounded data domain s.t. the gradient has finite sensitivity. With start parameter $\theta_0 = 0^d$, the $k^{\text{th}}$ step parameter $\theta_k$ is distributed as Gaussian with mean ${\mu_k=\step\bar{\x}\sum_{i=0}^{k-1}(1-\step)^i}$ and variance ${\sigma_k^2=\frac{2\eta\sigma^2}{\size^2}\sum_{i=0}^{k-1}(1-\step)^{2i}}$ in each dimension, where $\bar{x}=\sum_{i=1}^n\x_i/n$ is the empirical dataset mean. We explicitly compute the privacy loss at any step $\K$, which is lower bounded by ${\frac{\q S_g^2}{4\sig^2n^2}(1-e^{-\eta\K})}$.
Our RDP guarantee converges fast to $\frac{\alpha S_g^2}{\sig^2n^2}$, which matches the lower bound at every step $\K$, up to a constant of $4$. This immediately shows tightness of our converging RDP guarantee throughout the training process, for a converging noisy GD algorithm.
The randomness, required for satisfying differential privacy, can adversely affect the utility of the trained model. The standard way to measure the utility of a randomized ERM algorithm (for example, $\mathcal{A}_{\text{Noisy-GD}}$) is to quantify its worst case excess empirical risk, which is defined as
\[\max_{\D \in \Domain}\Expec{}{\Loss_{\D}(\ptheta) - \Loss_{\D}(\ptheta^*)},\]where $\ptheta$ is the output of the randomized algorithm $\mathcal{A}_{\text{Noisy-GD}}$ on $\D$, $\ptheta^*$ is the optimal solution to the standard (no privacy) ERM on $\Loss_\D$, and the expectation is computed over the randomness of the algorithm.
We provide the optimal excess empirical risk (utility) of noisy GD algorithm under an $(\eps, \delta)$-DP constraint. The notion of optimality for utility is defined as the smallest upper-bound for excess empirical risk that can be guaranteed under $(\eps, \delta)$-DP constraint by tuning the algorithm’s hyperparameters (such as the noise variance $\sig^2$ and the number of iterations $\K$).
This utility matches the following theoretical lower bound in Bassily et al.
Our utility matches this lower bound upto the constant factor $\log(1/\delta)$, when assuming $\frac{\beta}{\lambda^2}=O(1)$. This improves upon the previous gradient perturbation methods
Our algorithm also has significantly smaller gradient complexity than noisy SGD
We have developed a novel theoretical framework for analyzing the dynamics of privacy loss for noisy gradient descent algorithms. Our theoretical results show that by hiding the internal state of the training algorithm (over many iterations over the whole data), we can tightly analyze the rate of information leakage throughout training, and derive a bound that is significantly tighter than that of composition-based approaches.
Future Work. Our main result is a tight privacy guarantee for Noisy GD on smooth and strongly convex loss functions. The assumptions are very similar to that of the prior work on privacy amplification by iteration