This is the abstract of a current privacy conference proposal.
Many Big Data and online businesses proceed on a naive assumption that data in the "public domain" is up for grabs; technocrats are often surprised that conventional data protection laws can be interpreted to cover the extraction of PII from raw data. On the other hand, orthodox privacy frameworks don't cater for the way PII can be created in future from raw data collected today. This presentation will bridge the conceptual gap between data analytics and privacy, and offer new dynamic consent models to civilize the trade in PII for goods and services.
It’s often said that technology has outpaced privacy law, yet by and large that's just not the case. Technology has certainly outpaced decency, with Big Data and biometrics in particular becoming increasingly invasive. However OECD data privacy principles set out over thirty years ago still serve us well. Outside the US, rights-based privacy law has proven effective against today's technocrats' most worrying business practices, based as they are on taking liberties with any data that comes their way. To borrow from Niels Bohr, technologists who are not surprised by data privacy have probably not understood it.
The cornerstone of data privacy in most places is the Collection Limitation principle, which holds that organizations should not collect Personally Identifiable Information beyond their express needs. It is the conceptual cousin of security's core Need-to-Know Principle, and the best starting point for Privacy-by-Design. The Collection Limitation principle is technology neutral and thus blind to the manner of collection. Whether PII is collected directly by questionnaire or indirectly via biometric facial recognition or data mining, data privacy laws apply.
If anonymity is important, what is the legal basis for defending it?
I find that conventional data privacy law in most places around the world already protects anonymity, insofar as the act of de-anonymization represents an act of PII Collection - the creation of a named record. As such, de-anonymization cannot be lawfully performed without an express need to to do, or consent.
Cynics have been asking the same rhetorical question "is privacy dead?" for at least 40 years. Certainly information technology and ubiquitous connectivity have made it nearly impossible to hide, and so anonymity is critically ill. But privacy is not the same thing as secrecy; privacy is a state where those who know us, respect the knowledge they have about us. Privacy generally doesn't require us hiding from anyone; it requires restraint on the part of those who hold Personal Information about us.
The typical public response to data breaches, government surveillance and invasions like social media facial recognition is vociferous. People in general energetically assert their rights to not be tracked online, or to have their personal information exploited behind their backs. These reactions show that the idea of privacy alive and well.
The end of anonymity perhaps
Against a backdrop of spying revelations and excesses by social media companies especially in regards to facial recognition, there have been recent calls for a "new jurisprudence of anonymity"; see Yale law professor Jed Rubenfeld writing in the Washington Post of 13 Jan 2014. I wonder if there is another way to crack the nut? Because any new jurisprudence is going to take a very long time.
Instead, I suggest we leverage the way most international privacy law and privacy experience -- going back decades -- is technology neutral with regards to the method of collection. In some jurisdictions like Australia, the term "collection" is not even defined in privacy law. Instead, the law just uses the normal plain English sense of the word, when it frames principles like Collection Limitation: basically, you are not allowed to collect (by any means) Personally Identifiable Information without a good reasonable express reason. It means that if PII gets into a data system, the system is accountable under privacy law for that PII, no matter how it got there.
This technology neutral view of PII collection has satisfying ramifications for all the people who intuit that Big Data has got too "creepy". We can argue that if a named record is produced afresh by a Big Data process (especially if that record is produced without the named person being aware of it, and from raw data that was originally collected for some other purpose) then that record has logically been collected. Whether PII is collected directly, or collected indirectly, or is in fact created by an obscure process, privacy law is largely agnostic.
Prof Rubenfeld wrote:
- "The NSA program isn’t really about gathering data. It's about mining data. All the data are there already, digitally stored and collected by telecom giants, just waiting." [italics in original]
I suggest that the output of the data mining, if it is personally identifiable and especially if it has been rendered identifiable by processing previously anonymous raw data, has is a fresh collection by the mining operation. As such, the miners should be accountable for their newly minted PII, just as though they had collected gathered it directly from the persons concerned.
For now, I don't want to go further and argue the rights and wrongs of surveillance. I just want to show a new way to frame the privacy questions in surveillance and big data, making use of existing jurisprudence. If I am right and the NSA is in effect collecting PII as it goes about its data mining, then that provides a possibly fresh understanding of what's going on, within which we can objectively analyse the rights and wrongs.
I am actually the first to admit that within this frame, the NSA might still be justified in mining data, and there might be no actual technical breach of information privacy law, if for instance the NSA enjoys a law enforcement exemption. These are important questions that need to be debated, but elsewhere (see my recent blog on our preparedness to actually have such a debate). My purpose right now is to frame a way to defend anonymity using as much existing legal infrastructure as possible.
But Collection is not limited everywhere
There is an important legal-technical question in all this: Is the collection of PII actually regulated? In Europe, Australia, New Zealand and in dozens of countries, collection is limited, but in the USA, there is no general restriction against collecting PII. America has no broad data protection law, and in any case, the Fair Information Practice Principles (FIPPs) don't include a Collection Limitation principle.
So there may be few regulations in the USA that would carry my argument there! Nevertheless, surely we can use international jurisprudence in Collection Limitation instead of creating new American jurisprudence around anonymity?
So I'd like to put the following questions Jed Rubenfeld:
- Do technology neutral Collection Limitation Principles in theory provide a way to bring de-anonymised data into scope for data privacy laws? Is this a way to address peoples' concerns with Big Data?
- How does international jurisprudence around Collection Limitation translate to American schools of legal thought?
- Does this way of looking at the problem create new impetus for Collection Limitation to be introduced into American privacy principles, especially the FIPPs?
Appendix: "Applying Information Privacy Norms to Re-Identification"
In 2013 I presented some of these ideas to an online symposium at the Harvard Law School Petrie-Flom Center, on the Law, Ethics & Science of Re-identification Demonstrations. What follows is an extract from that presentation, in which I spell out carefully the argument -- which was not obvious to some at the time -- that when genetics researchers combine different data sets to demonstrate re-identification of donated genomic material, they are in effect collecting patient PII. I argue that this type of collection should be subject to ethics committee approval just as if the researchers were collecting the identities from the patients directly.
... I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently. In the first, Yaniv Erlich [at MIT's Whitebread Institute] used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be "very hard". In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.
A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in de-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements. The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information. The implications for genomic data re-identification are clear.
Let’s consider Subject S who donates her DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S and links her to the DNA sample. The fact is that R2 has collected personal information about S. If R2 has no relationship with S, then S has not consented to this new collection of her personal information.
Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the DNA sample later represents a new collection, one that has been undertaken without any consent. Given that S has no knowledge of R2, there can be no implied consent in her original understanding with R1, even if absolute anonymity was disclaimed.
Naturally the re-identification demonstrations have served a purpose. It is undoubtedly important that the limits of anonymity be properly understood, and the work of Yaniv and Latanya contribute to that. Nevertheless, these demonstrations were undertaken without the knowledge much less the consent of the individuals concerned. I contend that bioinformaticians using clever techniques to attach names to anonymous samples need ethics approval, just as they would if they were taking fresh samples from the people concerned.
See also my letter to the editor of Science magazine.
Yesterday it was reported by The Verge that anonymous hackers have accessed Snapchat's user database and posted 4.6 million user names and phone numbers. In an apparent effort to soften the blow, two digits of the phone numbers were redacted. So we might assume this is a "white hat" exercise, designed to shame Snapchat into improving their security. Indeed, a few days ago Snapchat themselves said they had been warned of vulnerabilities in their APIs that would allow a mass upload of user records.
The response of many has been, well, so what? Some people have casually likened Snapchat's list to a public White Pages; others have played it down as "just email addresses".
Let's look more closely. The leaked list was not in fact public names and phone numbers; it was user names and phone numbers. User names might often be email addresses but these are typically aliases; people frequently choose email addresses that reveal little or nothing of their real world identity. We should assume there is intent in an obscure email address for the individual to remain secret.
Identity theft has become a highly organised criminal enterprise. Crime gangs patiently acquire multiple data sets over many months, sometimes years, gradually piecing together detailed personal profiles. It's been shown time and time again by privacy researchers (perhaps most notably Latanya Sweeney) that re-identification is enabled by linking diverse data sets. And for this purpose, email addresses and phone numbers are superbly valuable indices for correlating an individual's various records. Your email address is common across most of your social media registrations. And your phone number allows your real name and street address to be looked up from reverse White Pages. So the Snapchat breach could be used to join aliases or email addresses to real names and addresses via the phone numbers. For a social engineering attack on a call centre -- or even to open a new bank account -- an identity thief can go an awful long way with real name, street address, email address and phone number.
I was asked in an interview to compare the theft of stolen phone numbers with social security numbers. I surprised the interviewer when I said phone numbers are probably even more valuable to the highly organised ID thief, for they can be used to index names in public directories, and to link different data sets, in ways that SSNs (or credit card numbers for that matter) cannot.
So let us start to treat all personal inormation -- especially when aggregated in bulk -- more seriously! And let's be more cautious in the way we categorise personal or Personally Identifiable Information (PII).
Importantly, most regulatory definitions of PII already embody the proper degree of caution. Look carefully at the US government definition of Personally Identifiable Information:
- information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual (underline added).
This means that items of data can constitute PII if other data can be combined to identify the person concerned. That is, the fragments are regarded as PII even if it is the whole that does the identifying.
And remember that the middle I in PII stands for Identifiable, and not, as many people presume, Identifying. To meet the definition of PII, data need not uniquely identify a person, it merely needs to be directly or indirectly identifiable with a person. And this is how it should be when we heed the way information technologies enable identification through linkages.
Almost anywhere else in the world, data stores like Snapchat's would automatically fall under data protection and information privacy laws; regulators would take a close look at whether the company had complied with the OECD Privacy Principles, and whether Snapchat's security measures were fit for purpose given the PII concerned. But in the USA, companies and commentators alike still have trouble working out how serious these breaches are. Each new breach is treated in an ad hoc manner, often with people finessing the difference between credit card numbers -- as in the recent Target breach -- and "mere" email addresses like those in the Snapchat and Epsilon episodes.
Surely the time has come to simply give proper regulatory protection to all PII.
The cover of Newsweek magazine on 27 July 1970 featured an innocent couple being menaced by cameras and microphones and new technologies like computer punch cards and paper tape. The headline hollered “IS PRIVACY DEAD?”.
The same question has been posed every few years ever since.
In 1999, Sun Microsystems boss Scott McNally urged us to “get over” the idea we have “zero privacy”; in 2008, Ed Giorgio from the Office of the US Director of National Intelligence chillingly asserted that “privacy and security are a zero-sum game”; Facebook’s Mark Zuckerberg proclaimed in 2010 that privacy was no longer a “social norm”. And now the scandal around secret surveillance programs like PRISM and the Five Eyes’ related activities looks like another fatal blow to privacy. But the fact that cynics, security zealots and information magnates have been asking the same rhetorical question for over 40 years suggests that the answer is No!
PRISM, as revealed by whistle blower Ed Snowden, is a Top Secret electronic surveillance program of the US National Security Agency (NSA) to monitor communications traversing most of the big Internet properties including, allegedly, Apple, Facebook, Google, Microsoft, Skype, Yahoo and YouTube. Relatedly, intelligence agencies have evidently also been obtaining comprehensive call records from major telephone companies, eavesdropping on international optic fibre cables, and breaking into the cryptography many take for granted online.
In response, forces lined up at tweet speed on both sides of the stereotypical security-privacy divide. The “hawks” say privacy is a luxury in these times of terror, if you've done nothing wrong you have nothing to fear from surveillance, and in any case, much of the citizenry evidently abrogates privacy in the way they take to social networking. On the other side, libertarians claim this indiscriminate surveillance is the stuff of the Stasi, and by destroying civil liberties, we let the terrorists win.
Governments of course are caught in the middle. President Obama defended PRISM on the basis that we cannot have 100% security and 100% privacy. Yet frankly that’s an almost trivial proposition. It's motherhood. And it doesn’t help to inform any measured response to the law enforcement challenge, for we don’t have any tools that would let us design a computer system to an agreed specification in the form of, say “98% Security + 93% Privacy”. It’s silly to us the language of “balance” when we cannot measure the competing interests objectively.
Politicians say we need a community debate over privacy and national security, and they’re right (if not fully conscientious in framing the debate themselves). Are we ready to engage with these issues in earnest? Will libertarians and hawks venture out of their respective corners in good faith, to explore this difficult space?
I suggest one of the difficulties is that all sides tend to confuse privacy for secrecy. They’re not the same thing.
Privacy is a state of affairs where those who have Personal Information (PII) about us are constrained in how they use it. In daily life, we have few absolute secrets, but plenty of personal details. Not many people wish to live their lives underground; on the contrary we actually want to be well known by others, so long as they respect what they know about us. Secrecy is a sufficient but not necessary condition for privacy. Robust privacy regulations mandate strict limits on what PII is collected, how it is used and re-used, and how it is shared.
Therefore I am a privacy optimist. Yes, obviously too much PII has broken the banks in cyberspace, yet it is not necessarily the case that any “genie” is “out of the bottle”.
If PII falls into someone’s hands, privacy and data protection legislation around the world provides strong protection against re-use. For instance, in Australia Google was found to have breached the Privacy Act when its StreetView cars recorded unencrypted Wi-Fi transmissions; the company cooperated in deleting the data concerned. In Europe, Facebook’s generation of tag suggestions without consent by biometric processes was ruled unlawful; regulators there forced Facebook to cease facial recognition and delete all old templates.
We might have a better national security debate if we more carefully distinguished privacy and secrecy.
I see no reason why Big Data should not be a legitimate tool for law enforcement. I have myself seen powerful analytical tools used soon after a terrorist attack to search out patterns in call records in the vicinity to reveal suspects. Until now, there has not been the technological capacity to use these tools pro-actively. But with sufficient smarts, raw data and computing power, it is surely a reasonable proposition that – with proper and transparent safeguards in place – population-wide communications metadata can be screened to reveal organised crimes in the making.
A more sophisticated and transparent government position might ask the public to give up a little secrecy in the interests of national security. The debate should not be polarised around the falsehood that security and privacy are at odds. Instead we should be debating and negotiating appropriate controls around selected metadata to enable effective intelligence gathering while precluding unexpected re-use. If (and only if) credible and verifiable safeguards can be maintained to contain the use and re-use of personal communications data, then so can our privacy.
For me the awful thing about PRISM is not that metadata is being mined; it’s that we weren’t told about it. Good governments should bring the citizenry into their confidence.
Are we prepared to honestly debate some awkward questions?
- Has the world really changed in the past 10 years such that surveillance is more necessary now? Should the traditional balances of societal security and individual liberties enshrined in our traditional legal structures be reviewed for a modern world?
- Has the Internet really changed the risk landscape, or is it just another communications mechanism. Is the Internet properly accommodated by centuries old constitutions?
- How can we have confidence in government authorities to contain their use of communications metadata? Is it possible for trustworthy new safeguards to be designed?
Many years ago, cryptographers adopted a policy of transparency. They have forsaken secret encryption algorithms, so that the maths behind these mission critical mechanisms is exposed to peer review and ongoing scrutiny. Secret algorithms are fragile in the long term because it’s only a matter of time before someone exposes them and weakens their effectiveness. Security professionals have a saying: “There is no security in obscurity”.
For precisely the same reason, we must not have secret government monitoring programs either. If the case is made that surveillance is a necessary evil, then it would actually be in everyone’s interests for governments to run their programs out in the open.
As we head towards 2014, de-identification of personal data sets is going to be a hot issue. I saw several things at last week's Constellation Connected Enterprise conference (CCE) that will make sure of this!
First, recall that in Australia a new definition of Personal Information (PI or "PII") means that anonymous data that can potentially be re-identified in future may have to be classified as PII today. I recently discussed how security and risk practitioners can deal with the uncertainty in re-identifiability.
And there's a barrage of new tracking, profiling and interior geo-location technologies (like Apple's iBeacon) which typically come with a promise of anonymity. See for example Tesco's announcement of face scanning for targeting adverts at their UK petrol stations.
The promise of anonymity is crucial, but it is increasingly hard to keep. Big Data techniques that join de-identified information to other data sets are able to ind correlations and reverse the anonymisation process. The science of re-identification started with the work of Dr Latanya Sweeny who famously identified a former governor and his medical records using zip codes and electoral roll data; more recently we've seen DNA "hackers" who can unmask anonymous DNA donors by joining genomic databases to public family tree information.
At CCE we saw many exciting Big Data developments, which I'll explore in more detail in coming weeks. Business Intelligence as-a-service is expanding rapidly, and is being flipped my innovative vendors to align (whether consciously or not) with customer centric Vendor Relationship Management models of doing business. And there are amazing new tools for enriching unstructured data, like newly launched Paxata's Adaptive Data Preparation Platform. More to come.
With the ability to re-identify data comes Big Responsibilities. I believe that to help businesses meet their privacy promises, we're going to need new tools to measure de-identification and hence gauge the risk of re-identification. It seems that some new generation data analytics products will allow us to run what-if scenarios to help understand the risks.
Just before CCE I also came across some excellent awareness raising materials from Voltage Security in Cupertino. Voltage CTO Terence Spies shared with me his "Deidentification Taxonomy" reproduced here with his kind permission. Voltage are leaders in Format Preserving Encryption and Tokenization -- typically used to hide credit card numbers from thieves in payment systems -- and they're showing how the tools may be used more broadly for de-identifying databases. I like the way Terence has characterised the reversibility (or not) of de-identification approaches, and further broken out various tokenization technologies.
Reference: Voltage Security. Reproduced with permission.
These are the foundations of the important new science of de-identification. Privacy engineers need to work hard at re-identification, so that consumers do not lose faith in the important promises made that so much data collected from their daily movements through cyber space are indeed anonymous.
The cover of Newsweek magazine on 27 July 1970 featured a cartoon couple cowered by computer and communications technology, and the urgent all-caps headline “IS PRIVACY DEAD?”
Four decades on, Newsweek is dead, but we’re still asking the same question.
Every generation or so, our notions of privacy are challenged by a new technology. In the 1880s (when Warren and Brandeis developed the first privacy jurisprudence) it was photography and telegraphy; in the 1970s it was computing and consumer electronics. And now it’s the Internet, a revolution that has virtually everyone connected to everyone else (and soon everything) everywhere, and all of the time. Some of the world’s biggest corporations now operate with just one asset – information – and a vigorous “publicness” movement rallies around the purported liberation of shedding what are said by writers like Jeff Jarvis (in his 2011 book “Public Parts”) to be old fashioned inhibitions. Online Social Networking, e-health, crowd sourcing and new digital economies appear to have shifted some of our societal fundamentals.
However the past decade has seen a dramatic expansion of countries legislating data protection laws, in response to citizens’ insistence that their privacy is as precious as ever. And consumerized cryptography promises absolute secrecy. Privacy has long stood in opposition to the march of invasive technology: it is the classical immovable object met by an irresistible force.
So how robust is privacy? And will the latest technological revolution finally change privacy forever?
Soaking in information
We live in a connected world. Young people today may have grown tired of hearing what a difference the Internet has made, but a crucial question is whether relatively new networking technologies and sheer connectedness are exerting novel stresses to which social structures have yet to adapt. If “knowledge is power” then the availability of information probably makes individuals today more powerful than at any time in history. Search, maps, Wikipedia, Online Social Networks and 3G are taken for granted. Unlimited deep technical knowledge is available in chat rooms; universities are providing a full gamut of free training via Massive Open Online Courses (MOOCs). The Internet empowers many to organise in ways that are unprecedented, for political, social or business ends. Entirely new business models have emerged in the past decade, and there are indications that political models are changing too.
Most mainstream observers still tend to talk about the “digital” economy but many think the time has come to drop the qualifier. Important services and products are, of course, becoming inherently digital and whole business categories such as travel, newspapers, music, photography and video have been massively disrupted. In general, information is the lifeblood of most businesses. There are countless technology-billionaires whose fortunes are have been made in industries that did not exist twenty or thirty years ago. Moreover, some of these businesses only have one asset: information.
Banks and payments systems are getting in on the action, innovating at a hectic pace to keep up with financial services development. There is a bewildering array of new alternative currencies like Linden dollars, Facebook Credits and Bitcoins – all of which can be traded for “real” (reserve bank-backed) money in a number of exchanges of varying reputation. At one time it was possible for Entropia Universe gamers to withdraw dollars at ATMs against their virtual bank balances.
New ways to access finance have arisen, such as peer-to-peer lending and crowd funding. Several so-called direct banks in Australia exist without any branch infrastructure. Financial institutions worldwide are desperate to keep up, launching amongst other things virtual branches and services inside Online Social Networks (OSNs) and even virtual worlds. Banks are of course keen to not have too many sales conducted outside the traditional payments system where they make their fees. Even more strategically, banks want to control not just the money but the way the money flows, because it has dawned on them that information about how people spend might be even more valuable than what they spend.
Privacy in an open world
For many for us, on a personal level, real life is a dynamic blend of online and physical experiences. The distinction between digital relationships and flesh-and-blood ones seems increasingly arbitrary; in fact we probably need new words to describe online and offline interactions more subtly, without implying a dichotomy.
Today’s privacy challenges are about more than digital technology: they really stem from the way the world has opened up. The enthusiasm of many for such openness – especially in Online Social Networking – has been taken by some commentators as a sign of deep changes in privacy attitudes. Facebook's Mark Zuckerberg for instance said in 2010 that “People have really gotten comfortable not only sharing more information and different kinds, but more openly and with more people - and that social norm is just something that has evolved over time”. And yet serious academic investigation of the Internet’s impact on society is (inevitably) still in its infancy. Social norms are constantly evolving but it’s too early to tell to if they have reached a new and more permissive steady state. The views of information magnates in this regard should be discounted given their vested interest in their users' promiscuity.
At some level, privacy is about being closed. And curiously for a fundamental human right, the desire to close off parts of our lives is relatively fresh. Arguably it’s even something of a “first world problem”. Formalised privacy appears to be an urban phenomenon, unknown as such to people in villages when everyone knew everyone – and their business. It was only when large numbers of people congregated in cities that they became concerned with privacy. For then they felt the need to structure the way they related to large numbers of people – family, friends, work mates, merchants, professionals and strangers – in multi-layered relationships. So privacy was borne of the first industrial revolution. It has taken prosperity and active public interest to create the elaborate mechanisms that protect our personal privacy from day to day and which we take for granted today: the postal services, direct dial telephones, telecommunications regulations, individual bedrooms in large houses, cars in which we can escape or a while, and now of course the mobile handset.
Privacy is about respect and control. Simply put, if someone knows me, then they should respect what they know; they should exercise restraint in how they use that knowledge, and be guided by my wishes. Generally, privacy is not about anonymity or secrecy. Of course, if we live life underground then unqualified privacy can be achieved, yet most of us exist in diverse communities where we actually want others to know a great deal about us. We want merchants to know our shipping address and payment details, healthcare providers to know our intimate details, hotels to know our travel plans and so on. Practical privacy means that personal information is not shared arbitrarily, and that individuals retain control over the tracks of their lives.
Big Data: Big Future
Big Data tools are being applied everywhere, from sifting telephone call records to spot crimes in the planning, to DNA and medical research. Every day, retailers use sophisticated data analytics to mine customer data, ostensibly to better uncover true buyer sentiments and continuously improve their offerings. Some department stores are interested in predicting such major life changing events as moving house or falling pregnant, because then they can target whole categories of products to their loyal customers.
Real time Big Data will become embedded in our daily lives, through several synchronous developments. Firstly computing power, storage capacity and high speed Internet connectivity all continue to improve at exponential rates. Secondly, there are more and more “signals” for data miners to choose from. No longer do you have to consciously tell your OSN what you like or what you’re doing, because new augmented reality devices are automatically collecting audio, video and locational data, and trading it around a complex web of digital service providers. And miniaturisation is leading to a whole range of smart appliances, smart cars and even smart clothes with built-in or ubiquitous computing.
The privacy risks are obvious, and yet the benefits are huge. So how should we think about the balance in order to optimise the outcome? Let’s remember that information powers the new digital economy, and the business models of many major new brands like Facebook, Twitter, Four Square and Google incorporate a bargain for Personal Information. We obtain fantastic services from these businesses “for free” but in reality they are enabled by all that information we give out as we search, browse, like, friend, tag, tweet and buy.
The more innovation we see ahead, the more certain it seems that data will be the core asset of cyber enterprises. To retain and even improve our privacy in the unfolding digital world, we must be able to visualise the data flows that we’re engaged in, evaluate what we get in return for our information, and determine a reasonable trade of costs and benefits
Is Privacy Dead? If the same rhetorical question needs to be asked over and over for decades, then it’s likely the answer is no.
I was invited to give a speech to launch Australian Privacy Awareness Week #2013PAW on April 29. This is an edited version of my speaking notes.
What does privacy mean to technologists?
I'm a technologist who stumbled into privacy. Some 12 years ago I was doing a big security review at a utility company. Part of their policy document set was a privacy statement posted on the company's website. I was asked to check it out. It said things like 'We collect the following information about you [the customer] ... If you ever want a copy of the information we have about you, please call the Privacy Officer ...'. I had a hunch this was problematic, so I took the document to the chief IT architect. He had never seen the privacy statement before, so that was the first problem. Moreover, he advised there was no way they could readily furnish complete customer details, for their CRM databases were all over the place. So IT was disenfranchised in the privacy statement, and the undertakings it contained were impractical.
Clearly there was a lot going on in privacy that we technologists needed to know. So with an inquiring mind, I took it upon myself to read the Privacy Act. And I was amazed by what I found. In fact I wrote a paper in 2003 about the ramifications for IT of the 10 National Privacy Principles, and that kicked off my privacy sub-career.
Ever since I've found time and time again a shortfall in the understanding that "technologists" as a class have regarding data privacy. There is a gap between technology and the law. IT professionals may receive privacy training but as soon as they hear the well-meaning slogan "Privacy Is Not A Technology Issue" they tend to say 'thank god: that's one thing I don't need to worry about'. Conversely, privacy laws are written with some naivety about how information flows in modern IT and how it aggregates automatically in standard computer systems. For instance, several clauses in Australian privacy law refer expressly to making 'annotations' in the 'records' as if they're all paper based, with wide margins.
The gap is perpetuated to some extent by the popular impression that the law has not kept up with the march of technology. As a technologist, I have to say I am not cynical about the law; I actually find that principles-based data privacy law anticipates almost all of the current controversies in cyberspace (though not quite all, as we shall see).
So let's look at a couple of simple technicalities that technologists don't often comprehend.
What Privacy Law actually says
Firstly there is the very definition of Personal Information. Lay people and engineers tend to intuit that Personal Information [or equivalently what is known in the US as Personally Identifiable Information] is the stuff of forms and questionnaires and call centres. So technologists can be surprised that the definition of Personal Information covers a great deal more. Look at the definition from the Australian federal Privacy Act:
Information or an opinion, whether true or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.
So if metadata or event logs in a computer system are personally identifiable, then they constitute Personal Information, even if this data has been completely untouched by human hands.
Then there is the crucial matter of collection. Our privacy legislation like that of most OECD countries is technology neutral with regards to the manner of collection of Personal Information. Indeed, the term "collection" is not defined in the Privacy Act. The word is used in its plain English sense. So if Personal Information has wound up in an information system, it doesn't matter if it was gathered directly from the individual concerned, or whether it has instead been imported or found in the public domain or generated almost from scratch by some algorithm: the Personal Information has been collected and as such is covered by the Collection Principle of the Privacy Act. That is to say:
An organisation must not collect Personal Information unless the information is necessary for one or more of its functions or activities.
Editorial Note: One of the core differences between most international privacy law and the American environment is that there is no Collection Limitation in the Fair Information Practice Principles (FIPPs). The OECD approach tries to head privacy violations "off at the pass" by discouraging collection of PII if it is not expressly needed, but in the US business sector there is no such inhibition.
Now let's look at some of the missteps that have resulted from technologists accidentally overlooking these technicalities (or perhaps technocrats more deliberately ignoring them).
1. Google StreetView Wi-Fi collection
Google StreetView cars collect Wi-Fi hub coordinates (as landmarks for Google's geo-location services). On their own Wi-Fi locations are unidentified, but it was found that the StreetView software was also inadvertently collecting Wi-Fi network traffic, some of which contained Personal Information (like user names and even passwords). Australian and Dutch Privacy Commissioners found Google was in breach of respective data protection laws.
Many technologists I found argued that Wi-Fi data in the "public domain" is not private, and "by definition" (so they liked to say) it categorically could not be private. Therefore they believed Google was within its rights to do whatever it liked with such data. But the argument fails to grasp the technicality that our privacy laws basically do not distinguish public from "private". In fact the words "public" and "private" are not operable in the Privacy Act (which is really more of a data protection law). If data is identifiable, then privacy sanctions attach to it.
The lesson for Big Data privacy is this: it doesn't much matter if Personal Information is sourced from the public domain: you are still subject to Collection and Use Limitation principles (among others) once it is in your custody.
2. Facebook facial recognition
Facebook photo tagging creates biometric templates used to subsequently generate tag suggestions. Before displaying suggestions, Facebook's facial recognition algorithms run in the background over all photo albums. When they make a putative match and record a deduced name against a hitherto anonymous piece of image data, the Facebook system has collected Personal Information.
European privacy regulators in mid 2012 found biometric data collection without consent to be a serious breach, and by late 2012 had forced Facebook to shut down facial recognition and tag suggestions in the EU. This was quite a show of force over one of the most powerful companies of the digital age.
The lesson for Big Data privacy is this: it doesn't much matter if you generate Personal Information almost out of thin air, using sophisticated data processing algorithms: you are still subject to Privacy Principles, such as Openness as well as Collection and Use Limitation.
3. Target's pregnancy predictions
The department store Target in the US was found by New York Times investigative journalists to be experimenting with statistical methods for identifying that a regular customer is likely to be pregnant, by looking for trends in her buying habits. Retail strategists are keen to win the loyalty of pregnant women so as to secure their lucrative business through the expensive early years of parenting.
There are all sorts of issues here. One technicality I wish to draw out is that in Australia, the privacy implications would be amplified by the fact that tagging someone in a database as pregnant [even if that prediction is wrong!] creates health information, and therefore represents a collection of Sensitive Information. Express informed consent is required in advance of collecting Sensitive Information. So if Australian stores want to use Big Data techniques, they may need to disclose to their customers up front that health information might be extracted by mining their buying habits, and obtain express consent for the algorithms to run. Remember Australia sets a low bar for privacy breaches: simply collecting Sensitive Personal Information may be a breach even before it is used for anything or disclosed.
Note also there is already a latent problem in Australia for grocery stores that sell medicinals online, and this has nothing to do with Big Data. St Johns Wort for example may seem innocuous but it indicates that a customer has (or believes they have) depression. IT security managers might not have thought about the implications of logging mental health information in ordinary old web servers and databases.
4. "DNA Hacking"
In February this year, research was published where a subset of anonymous donors to a DNA research program in the UK were identified by cross-matching genes to data in US based public genealogy databases. All of a sudden, the ethics of re-identifying genetic material has become a red hot topic. Much attention is focusing on the nature of the informed consent; different initiatives (like the Personal Genome Project and 1,000 Genomes) give different levels of comfort about the possibility of re-identification. Absolute anonymity is typically disclaimed but donors in some projects are reassured that re-identification will be 'difficult'.
But regardless of the consent given by a Subject (1st party) to a researcher (2nd party), a nice legal problem arises when a separate 3rd party takes anonymous data and re-identifies it without consent. Technically the 3rd party has collected Personal Information, as per the principles discussed above, and that may require consent under privacy laws. Following on from the European facial recognition precedent, I contend that re-identification of DNA without consent is likely to be ruled problematic (if not unlawful) in some jurisdictions. And it therefore unethical in all fair minded jurisdictions.
Big Data's big challenge
So principles-based data protection laws have proven very powerful in the cases of Google's StreetView Wi-Fi collection and Facebook's facial recognition (even though these scenarios could not have been envisaged with any precision 30 years ago when the OECD privacy principles were formulated). And they seem to neatly govern DNA re-identification and data mining for health information, insofar as we can foresee how these activities may conflict with legislated principles and might therefore be brought to book. But there is one area where our data privacy principles may struggle to cope with Big Data: openness.
Orthodox privacy management involves telling individuals What information is collected about them, Why it is needed, When it is collected, and How. But with Big Data, even if a company wants to be completely transparent, it may not know what Personal Information lies waiting to be mined and discovered in the data, nor when exactly this discovery might be done.
An underlying theme in Big Data business models is data mining, or perhaps more accurately, data refining, as shown in the diagram here. An increasing array of data processing techniques are applied to vast stores of raw information (like image data in the example) to extract metadata and increasingly valuable knowledge.
There is nothing intrinsically wrong with a business model that extracts value from raw information, even if it converts anonymous data into Personal Information. But the privacy promise enshrined in OECD data protection laws – namely to be open with individuals about what is known about them and why – can become hard to honour.
There is a bargain at the heart of most social media companies today, in which Personal Information is traded for a rich array of free services. The bargain is opaque; the "infomopolies" are coy about the value they attach to the Personal Information of their members.
If Online Social Networks were more open about their business models, I think it likely that most of members would still be happy with the bargain. After all, Google, Facebook, Twitter et al have become indispensable for many of us. They do deliver fantastic value. But the Personal Information trade needs to be transparent.
"Big Privacy" Principles
In conclusion, I offer some expanded principles for protecting privacy in Big Data.
Exercise constraint: More than ever, remember that privacy is essentially about restraint. If a business knows me, then privacy means simply that the business is restrained in how it uses that knowledge.
Meta transparency: We're at the very start of the Big Data age. Who knows what lies ahead? Meta transparency means not only being open about what Personal Information is collected and why, but also being open about the business model and the emerging tools.
Engage customers in a fair value deal: Most savvy digital citizens appreciate there is no such thing as a free lunch; they already know at some level that "free" digital services are paid for by trading Personal Information. Many netizens have learned already to manage their own privacy in an ad hoc way, for instance obfuscating or manipulating the personal details they divulge. Ultimately consumers and businesses alike will do better by engaging in a real deal that sets out how PI is truly valued and leveraged.
- Re-identification of DNA may need ethics approval
- It's not too late for privacy
- Photo data as crude oil
- What stops Target telling you're pregnant?.
No it doesn't, it only means the end of anonymity.
Anonymity is not the same thing as privacy. Anonymity keeps people from knowing what you're doing, and it's a vitally important quality in many settings. But in general we usually want people (at least some people) to know what we're up to, so long as they respect that knowledge. That's what privacy is all about. Anonymity is a terribly blunt instrument for protecting privacy, and it's also fragile. If anonymity was all you have, then you're in deep trouble when someone manages to defeat it.
New information technologies have clearly made anonymity more difficult, yet it does not follow that we must lose our privacy. Instead, these developments bring into stark relief the need for stronger regulatory controls that compel restraint in the way third parties deal with Personal Information that comes into their possession.
A great example is Facebook's use of facial recognition. When Facebook members innocently tag one another in photos, Facebook creates biometric templates with which it then automatically processes all photo data (previously anonymous), looking for matches. This is how they can create tag suggestions, but Facebook is notoriously silent on what other applications it has for facial recognition. Now and then we get a hint, with, for example, news of the Facedeals start up last year. Facedeals accesses Facebook's templates (under conditions that remain unclear) and uses them to spot customers as they enter a store to automatically check them in. It's classic social technology: kinda sexy, kinda creepy, but clearly in breach of Collection, Use and Disclosure privacy principles.
And indeed, European regulators have found that Facebook's facial recognition program is unlawful. The chief problem is that Facebook never properly disclosed to members what goes on when they tag one another, and they never sought consent to create biometric templates with which to subsequently identify people throughout their vast image stockpiles. Facebook has been forced to shut down their facial recognition operations in Europe, and they've destroyed their historical biometric data.
So privacy regulators in many parts of the world have real teeth. They have proven that re-identification of anonymous data by facial recognition is unlawful, and they have managed to stop a very big and powerful company from doing it.
This is how we should look at the implications of the DNA 'hacking'. Indeed, Melissa Gymrek from the Whitehead Institute said in an interview: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously."
Perhaps this episode will bring even more attention to the problem in the USA, and further embolden regulators to enact broader privacy protections there. Perhaps the very extremeness of the DNA hacking does not spell the end of privacy so much as its beginning.
I had a letter published in Science magazine about the recently publicised re-identification of anonymously donated DNA data. It has been shown that there is enough named genetic information online, in genealogical databases for instance, that anonymous DNA posted in research databases can be re-identified. This is a sobering result indeed. But does it mean that 'privacy is dead'?
No. The fact is that re-identification of erstwhile anonymous data represents an act of collection of PII and is subject to the Collection Limitation Principle in privacy law around the world. This is essentially the same scenario as Facebook using biometric facial recognition to identify people in photos. European regulators recently found Facebook to have breached privacy law and have forced Facebook to shut down their facial recognition feature.
I expect that the very same legal powers will permit regulators to sanction the re-identification of DNA. There are legal constraints on what can be done with 'anonymous' data no matter where you get it from: under some data privacy laws, attaching names to such data constitutes a Collection of PII, and as such, is subject to consent rules and all sorts of other principles. As a result, bioinformatics researchers will have to tread carefully, justifying their ends and their means before ethics committees. And corporations who seek to exploit the ability to put names on anonymous genetic data may face the force of the law as Facebook did.
To summarise: Let's assume Subject S donates their DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S as belonging to the DNA sample. The fact that many commentators seem oblivious to is this: R2 has Collected Personal Information (or PII) about S. If R2 has no relationship with S, then S has not consented to this new collection of her PII. In jurisdictions with strict Collection Limitation (like the EU, Australia and elsewhere) then it seems to me to be a legal privacy breach for R2 to collect PII by way of DNA re-identification without express consent, regardless of whether R1 has conceded to S that it might happen. Even in the US, where the protections might not be so strict, there remains a question of ethics: should R2 conduct themselves in a manner that might be unlawful in other places?
The text of my letter to Science follows, and after that, I'll keep posting follow ups.
Science 8 February 2013:
Vol. 339 no. 6120 pp. 647
Yaniv Erlich at the Whitehead Institute for Biomedical Research used his hacking skills to decipher the names of anonymous DNA donors ("Genealogy databases enable naming of anonymous DNA donor," J. Bohannon, 18 January, p. 262). A little-known legal technicality in international data privacy laws could curb the privacy threats of reverse identification from genomes. "Personal information" is usually defined as any data relating to an individual whose identity is readily apparent from the data. The OECD Privacy Principles are enacted in over 80 countries worldwide . Privacy Principle No. 1 states: "There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject." The principle is neutral regarding the manner of collection. Personal information may be collected directly from an individual or indirectly from third parties, or it may be synthesized from other sources, as with "data mining."
Computer scientists and engineers often don't know that recording a person's name against erstwhile anonymous data is technically an act of collection. Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the information later signifies a new collection. The new collection of personal information requires its own consent; the original disclaimer does not apply when third parties take data and process it beyond the original purpose for collection. Educating those with this capability about the legal meaning of collection should restrain the misuse of DNA data, at least in those jurisdictions that strive to enforce the OECD principles.
It also implies that bioinformaticians working "with little more than the Internet" to attach names to samples may need ethics approval, just as they would if they were taking fresh samples from the people concerned.
Lockstep Consulting Pty Ltd
Five Dock Sydney, NSW 2046, Australia.
In an interview with Science Magazine on Jan 18, the Whitehead Institute's Melissa Gymrek discussed the re-identification methods, and the potential to protect against them. She concluded: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously.".
I agree completely. We need regulations. Elsewhere I've argued that anonymity is an inadequate way to protect privacy, and that we need a balance of regulations and Privacy Enhancing Technologies. And it's for this reason that I am not fatalistic about the fact that anonymity can be broken, because we have the procedural means to see that privacy is still preserved.
The mea culpa is a classic, straight out of the Zuckerberg copybook. They say they were misunderstood. They say they don't want to sell photos to ad men. They say members will always own their photos. But ownership is a red herring and the whole exercise is likely a stalking horse, designed to distract people from more significant issues around metadata and Facebook's ever deepening ability to infer PII.
Firstly, let's be clear that greater sharing follows the acquisition as night follows day. I noted at the time that the only way to understand Facebook's billion dollar spend on Instagram is around the value to be mined from the mother lode of photo data. In particular, image analysis and facial recognition grant Instagram and Facebook x-ray vision into their members' daily lives. They can work out what people are doing, with whom they're doing it, when and where. With these tools, they're moving quickly from collecting Personally Identifiable Information when it is volunteered by users, to PII that is observed and inferred. The quality and quantity of the PII flux is driven up dramatically. No longer is the lifeblood of Facebook -- the insights they have on 15% of the world's population -- filtered by what users elect to post and Like and tag, but now that information is raw, unexpurgated and automated.
Now ask where the money in photo data is to be made. It's not in selling candid snapshots of folks enjoying branded products. It's in the intelligence that image data yield about how people lead their lives. This intelligence is Facebook's one and only asset.
So it is metadata that we need to worry about. In its initial update to the Terms, Instagram said this: [You] agree that a business or other entity may pay us to display your username, likeness, photos (along with any associated metadata), and/or actions you take, in connection with paid or sponsored content or promotions, without any compensation to you.. In over 6,000 words "metadata" is mentioned just twice, parenthetically, and without any definition. Metadata is figuring more and more in the privacy discourse, and that's great, but we need to look beyond the usual stuff like geolocation and camera type embedded in the JPEGs. Much more important now is the latent identifiable personal content in images. Image analysis and image search provide endless new possibilities for infomopolies to extract value from photos.
A great deal of this week's outcry has focused on things like the lack of compensation, and all of Instagram's apology today is around the ownership of photos. But ownership is moot if they reserve their right to use and disclose metadata in any way they like. What actually matters is the individual's ability to understand and control what is done with any PII about them, including metadata. When the German privacy regulator acted against Facebook's facial recognition practices earlier this year, the principle they applied from OECD style legislation is that there are limits to what can be collected about individuals without their consent. The regulator ruled it unlawful for Facebook to extract biometric information from images when their users innocently think they're only tagging people in photos.
So when I read Instagram's excuse, I don't see any truly meaningful self-restraint in the way they can exploit image data. Their switch is not even a tactical retreat, for as yet, they're not giving anything up.