User Profiling and Anonymization

User Profiling and Anonymization

Profiling is a trending topic for the last couple of years. Every search you have done, web pages you have visited, your mouse movements, even your voice, steps, and heartbeats are being recorded. The reason behind it is to provide a better user experience most of the time. They are showing you the items that you may want to buy, the topics you may want to read not only by your actions but also predicting things using a different kind of algorithms and techniques.


With every mobile application or web page login, we are accepting them to collect information about us. So one question you may wonder is that are we secure.? Well, there is no simple answer to that question.


In previous years, there have been public dataset sharings about user profiles. Although the information shared is cleaned from personal information, attackers were able to re-identify people by looking at the shared data. The reason is every user profile is unique. There is no other person in this world using a mouse such as you, watching the same movies with the same order and also like pineapple pizza such as you.


There are some methods for data anonymization for the profiles. Some of these methods changes or hold limited information about you. So that your record will not be unique and an attacker can not be able to reach you if they do not have some background knowledge about you. The method does not store your age but records as 35-40. Here below, you can find the most widely used methods.


1. Differential Privacy

System for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy.


2. K-Anonymity

In each group of the sanitized dataset, each individual will be identical to at least k-1 others. It does not provide privacy if sensitive values in an equivalence class lack diversity or attacker have background knowledge.


3. L-Diversity

It maintains the diversity for each group with respect to the possible values of the sensible attributes. It can be instantiated by a metric based on entropy. It prevents attacks based on homogeneity and some other attacks.


4. T-Closeness

The distribution of the attributes in each group must be close to that on the global population. T is a threshold that should not be exceeded and which represents the proximity between distributions.


Biometric Data

For the biometric data, the situation is a little bit different. Once the information about you stolen, there is no way back. Malicious attackers may open a bank account for your name or may do other illegal things. The solution for biometric data is storing the data by adding some noise to it with well-designed algorithms. So if the data were stolen, they could not use it for impersonating you. When you show your face to screen or put your finger on some scanner, changed version of your data will be compared with the stored one and will be matched.


Finally, there is no escape for sharing information unless you do not stop using any technological device. As far as I know, big tech companies have strict regularizations about user privacy. Even so, I would not trust them since frequently updated policies. Moreover, anonymization and privacy issues are not settled yet and still a hot research topics. My suggestion is being cautious while sharing personal information. The movies you like may not be harmful, but attackers may use your personal information, and it may lead to unbelievable damage.