Name	Had Chocolate?
A	Yes
B	No
C	Yes
D	No

Name	Had Chocolate?
A	Yes
B	No
C	Yes
D	No

Differential Privacy

Formal definition

A randomized computation M satisfies $\epsilon$ -differential privacy if for any adjacent data sets $x$ and $x'$ , and any subset C of possible outcomes Range(M)

$Pr[M(x) \in C] \leq exp(\epsilon) \times Pr[M(x') \in C]$

As we remove rows, the answer distribution remains same with a factor of $e^\epsilon$
Smaller the $\epsilon$ , better the utility. But more data is also leaked.

Dwork, Cynthia, and Aaron Roth. "The algorithmic foundations of differential privacy." Foundations and Trends in Theoretical Computer Science 9.3-4 (2014): 211-407.

Secure and Private Machine Learning

Subhankar Mishra

Big Data Era

Risks of exposure?

Privacy and Utility

Privacy vs Security

Example

What do we do?

Strategy 1

What do we do?

Strategy 2

What do we do?

Strategy 3

What do we do?

Strategy 3 - Randomized Response

Example

Analysis?

Differential Privacy

Formal definition

Idea behind Differential Privacy

Alternative to differential privacy?

Differential Privacy in the world

Let's design a ML system with sensitive user data

Sensitive Data

Privacy Nightmare

However, we need the data!

Best source of data?

Devices that we use everyday

So what if the data never leaves your device?

How would we train our ML model then?

Federated Learning (FL)!

FL - Data stays on device

FL - Bring the training to the device

But Training is resource heavy

Select the devices that are

Steps

Step 1 - Select a subset of eligible devices

Step 2 - Each device receives a training model

Step 3 - Train on the device

Step 4 - Send the trained model back

At Center - FedAvg

So we good then?

Not exactly

Problem 1: Model inversion attacks!

Secure Aggregation

Secure Aggregation

Problem 2: What if a model memorizes data?

Differential Privacy

Problem 3: How to test?

Test at the devices !

Summary

Work from our lab

Thank you

Let's design a ML system with `sensitive` user data