It is widely accepted that many companies such as Facebook, Google, Amazon, etc have vast amounts of data on our friends, interests, and spending habits amongst other things. At times, for example for data mining or scientific collaboration, it can be useful for companies to access internal and external data. However, they are rightfully blocked by Privacy laws.
Hence, there is an increasing work in thinking of ways to properly anonymise data to enable mining. Aggregation of data to wash PII (personally identifiable information) is one way to achieve this, but can lose important granularity and detail.
Raffael Strassnig, VP Data Scientist at Barclays retail bank, spoke at a summit last month to stress the importance of protecting privacy. Anonymising data at scale is a very hard problem but Strassnig’s team have implemented an algorithm by a PhD candidate and modified it to work on Barclay’s Big Data. The method involves:
“clustering the data into k-means clusters, with no cluster overlapping, the clusters being a certain size to comply with k-anonymity constraint, and minimising the loss of data when applying the procedure to the dataset by using a dissimilarity measure”
Future developments in application of Machine Learning techniques may enable use of PII without anonymisation. Until then, the Data Science team at Barclays is leading the way in protecting their users’ data while processing it.