Metadata is everywhere. Everything you tweet, every picture you take, and every status update you post on Facebook. It’s used by police and security forces to identify people who try to hide their identities and locations, while associated metadata in selfies can inadvertently ensnare criminals unaware that the data can destroy their alibi.
And metadata on Twitter can also be used in extremely precise identification each and every one of us – according to a new paperby researchers at University College London and the Alan Turing Institute. Your tweets, it turns out, no matter how anonymous you might think they are, can be traced back to you with unerring accuracy. All someone needs to do is look at the metadata.
The scientists used tweets and the associated metadata to identify any user in a group of 10,000 Twitter users with 96.7 per cent accuracy. Even when muddling up to 60 per cent of the metadata, the model could still pinpoint a single person with more than 95 per cent accuracy.
“Metadata is much larger when compared to the actual content of a tweet,” says Savvas Zannettou, a PhD student at the Cyprus University of Technology. People wrongly assume that because the data is online, they aren’t vulnerable to identification, adds Beatrice Perez of University College London, a co-author of the paper.
No right-thinking person would tell a total stranger what their address is if approached on the street. But they might tell them how often they turn their bedroom light on and off. “That’s the mentality with metadata,” says Perez. “People think it’s not a big deal. But couple it with another piece of information and I know when you’re home or not.”
It’s a commonly held belief, agrees Zannettou. “The average person doesn’t recognise that she can be easily identified using metadata.” Most Twitter users, he reckons, have no idea that Twitter holds 144 pieces of metadata on them, which is publicly accessible through the site’s API.
Being anonymous won’t help
The researchers took a corpus of five million Twitter users and ran 14 pieces of metadata from their tweets (including the time the account was created, the time a tweet was published, and the number of favourites, followers and following) through three different machine learning algorithms.
The most efficient at identifying individual accounts with the best accuracy was also one of the most basic machine learning algorithms, say the researchers. It showed that it’s possible to identify with near-precise accuracy an individual using just a handful of pieces of metadata.
It does so by training the model with a known dataset of users, demonstrating that they behave in a certain way on Twitter based on the metadata of their tweets. When the model is run “in the wild”, using new tweets from the same users, it can unpick people’s behaviour from metadata, identifying them as a specific individual.
Trying to anonymise the data collected by social networks isn’t the answer, says Perez. “It’s very hard to anonymise a data set,” she explains. Triangulation using one or more sets of data is easy to do, and can often undo any attempts to remove identifying information.
Perez and her colleagues proved that by obfuscating the dataset they had from Twitter, removing some fields to try and make it more difficult for their system to pinpoint individuals. “If we had a few data points not blurred, it was still easy,” she says. The identification rate stayed largely stable right up to the point at which all unique elements are removed – and it becomes impossible to discern one person from any other.
Things are likely to improve following the introduction of GDPR in late May. “I think we’re going to find more scrutiny around metadata,” explains Pat Walshe, a data protection consultant. Article 25 of GDPR calls for “data protection by design and by default”. That regulation, also called data minimisation, requires that only the specific data required to carry out a task is processed by companies.
But the bigger question, beyond whether it’s right or not that companies can hold so much identifying information about us all, is whether the average person values their privacy in the first place. “For sure, the average user should care,” says Zannettou. “But I’m sceptical if they do.”