First Day Reflections from CIS Monterey.
Follow along on Twitter at #CISmcc (for the Monterey Conference Centre).
The Cloud Identity Summit really is the top event on the identity calendar. The calibre of the speakers, the relevance and currency of the material, the depth and breadth of the cohort, and the international spread are all unsurpassed. It's been great to meet old cyber-friends in "XYZ Space" at last -- like Emma Lindley from the UK and Lance Peterman. And to catch up with such talented folks like Steffen Sorensen from New Zealand once again.
A day or two before, Ian Glazer of Salesforce asked in a tweet what we were expecting to get out of CIS. And I replied that I hoped to change my mind about something. It's unnerving to have your understanding and assumptions challenged by the best in the field ... OK, sometimes it's outright embarrassing ... but that's what these events are all about. A very wise lawyer said to me once, around 1999 at the dawn of e-commerce, that he had changed his mind about authentication a few times up to that point, and that he fully expected to change his mind again and again.
I spent most of Saturday in Open Identity Foundation workshops. OIDF chair Don Thibeau enthusiastically stressed two new(ish) initiatives: Mobile Connect in conjunction with the mobile carrier trade association GSM Association @GSMA, and HIE Connect for the health sector. For the uninitiated, HIE means Health Information Exchange, namely a hub for sharing structured e-health records among hospitals, doctors, pharmacists, labs, e-health records services, allied health providers, insurers, drug & device companies, researchers and carers; for the initiated, we know there is some language somewhere in which the letters H.I.E. stand for "Not My Lifetime".
But seriously, one of the best (and pleasantly surprising) things about HIE Connect as the OIDF folks tell it, is the way its leaders unflinchingly take for granted the importance of privacy in the exchange of patient health records. Because honestly, privacy is not a given in e-health. There are champions on the new frontiers like genomics that actually say privacy may not be in the interests of the patients (or more's the point, the genomics businesses). And too many engineers in my opinion still struggle with privacy as something they can effect. So it's great -- and believe me, really not obvious -- to hear the HIE Connects folks -- including Debbie Bucci from the US Dept of Health and Human Services, and Justin Richer of Mitre and MIT -- dealing with it head-on. There is a compelling fit for the OAUTH and OIDC protocols here, with their ability to manage discrete pieces of information about users (patients) and to permission them all separately. Having said that, Don and I agree that e-health records permissioning and consent is one of the great UI/UX challenges of our time.
Justin also highlighted that the RESTful patterns emerging for fine-grained permissions management in healthcare are not confined to healthcare. Debbie added that the ability to query rare events without undoing privacy is also going to be a core defining challenge in the Internet of Things.
MyPOV: We may well see tremendous use cases for the fruits of HIE Exchange before they're adopted in healthcare!
In the afternoon, we heard from Canadian and British projects that have been working with the Open Identity Exchange (OIX) program now for a few years each.
Emma Lindley presented the work they've done in the UK Identity Assurance Program (IDAP) with social security entitlements recipients. These are not always the first types of users we think of for sophisticated IDAM functions, but in Britain, local councils see enormous efficiency dividends from speeding up the issuance of eg disabled parking permits, not to mention reducing imposters, which cost money and lead to so much resentment of the well deserved. Emma said one Attributes Exchange beta project reduced the time taken to get a 'Blue Badge' permit from 10 days to 10 minutes. She went on to describe the new "Digital Sources of Trust" initiative which promises to reconnect under-banked and under-documented sections of society with mainstream financial services. Emma told me the much-abused word "transformational" really does apply here.
MyPOV: The Digital Divide is an important issue for me, and I love to see leading edge IDAM technologies and business processes being used to do something about it -- and relatively quickly.
Then Andre Boysen of SecureKey led a discussion of the Canadian identity ecosystem, which he said has stabilised nicely around four players: Federal Government, Provincial Govt, Banks and Carriers. Lots of operations and infrastructure precedents from the payments industry have carried over.
Andre calls the smart driver license of British Columbia the convergence of "street identity and digital identity".
MyPOV: That's great news - and yet comparable jurisdictions like Australia and the USA still struggle to join governments and banks and carriers in an effective identity synthesis without creating great privacy and commercial anxieties. All three cultures are similarly allergic to identity cards, but only in Canada have they managed to supplement drivers licenses with digital identities with relatively high community acceptance. In nearly a decade, Australia has been at a standstill in its national understanding of smartcards and privacy.
For mine, the CIS Quote of the Day came from Scott Rice of the Open ID Foundation. We all know the stark problem in our industry of the under-representation of Relying Parties in the grand federated identity projects. IdPs and carriers so dominate IDAM. Scott asked us to imagine a situation where "The auto industry was driven by steel makers". Governments wouldn't put up with that for long.
Can someone give us the figures? I wonder if Identity and Access Management is already more economically ore important than cars?!
Cheers from Monterey, Day 1.
We live in an age where billionaires are self-made on the back of the most intangible of assets – the information they have amassed about us. That information used to be volunteered in forms and questionnaires and contracts but increasingly personal information is being observed and inferred.
The modern world is awash with data. It’s a new and infinitely re-usable raw material. Most of the raw data about us is an invisible by-product of our mundane digital lives, left behind by the gigabyte by ordinary people who do not perceive it let alone understand it.
Many Big Data and digital businesses proceed on the basis that all this raw data is up for grabs. There is a particular widespread assumption that data in the "public domain" is free-for-all, and if you’re clever enough to grab it, then you’re entitled to extract whatever you can from it.
In the webinar, I'll try to show how some of these assumptions are naive. The public is increasingly alarmed about Big Data and averse to unbridled data mining. Excessive data mining isn't just subjectively 'creepy'; it can be objectively unlawful in many parts of the world. Conventional data protection laws turn out to be surprisingly powerful in in the face of Big Data. Data miners ignore international privacy laws at their peril!
Today there are all sorts of initiatives trying to forge a new technology-privacy synthesis. They go by names like "Privacy Engineering" and "Privacy by Design". These are well meaning efforts but they can be a bit stilted. They typically overlook the strengths of conventional privacy law, and they can miss an opportunity to engage the engineering mind.
It’s not politically correct but I believe we must admit that privacy is full of contradictions and competing interests. We need to be more mature about privacy. Just as there is no such thing as perfect security, there can never be perfect privacy either. And is where the professional engineering mindset should be brought in, to help deal with conflicting requirements.
If we’re serious about Privacy by Design and Privacy Engineering then we need to acknowledge the tensions. That’s some of the thinking behind Constellation's new Big Privacy compact. To balance privacy and Big Data, we need to hold a conversation with users that respects the stresses and strains, and involves them in working through the new privacy deal.
The webinar will cover these highlights of the Big Privacy pact:
- Respect and Restraint
- Super transparency
- And a fair deal for Personal Information.
Have a disruptive technology implementation story? Get recognised for your leadership. Apply for the 2014 SuperNova Awards for leaders in disruptive technology.
For the past year, oncologists at the Memorial Sloan Kettering Cancer Centre in New York have been training IBM’s Watson – the artificial intelligence tour-de-force that beat allcomers on Jeopardy – to help personalise cancer care. The Centre explains that "combining [their] expertise with the analytical speed of IBM Watson, the tool has the potential to transform how doctors provide individualized cancer treatment plans and to help improve patient outcomes". Others are speculating already that Watson could "soon be the best doctor in the world".
I have no doubt that when Watson and things like it are available online to doctors worldwide, we will see overall improvements in healthcare outcomes, especially in parts of the world now under-serviced by medical specialists [having said that, the value of diagnosing cancer in poor developing nations is questionable if they cannot go on to treat it]. As with Google's self-driving car, we will probably get significant gains eventually, averaged across the population, from replacing humans with machines. Yet some of the foibles of computing are not well known and I think they will lead to surprises.
For all the wondrous gains made in Artificial Intelligence, where Watson now is the state-of-the art, A.I. remains algorithmic, and for that, it has inherent limitations that don't get enough attention. Computer scientists and mathematicians have know for generations that some surprisingly straightforward problems have no algorithmic solution. That is, some tasks cannot be accomplished by any universal step-by-step codified procedure. Examples include the Halting Problem and the Travelling Salesperson Problem. If these simple challenges have no algorithm, we need be more sober in our expectations of computerised intelligence.
A key limitation of any programmed algorithm is that it must make its decisions using a fixed set of inputs that are known and fully characterised (by the programmer) at design time. If you spring an unexpected input on any computer, it can fail, and yet that's what life is all about -- surprises. No mathematician seriously claims that what humans do is somehow magic; most believe we are computers made of meat. Nevertheless, when paradoxes like the Halting Problem abound, we can be sure that computing and cognition are not what they seem. We should hope these conundrums are better understood before putting too much faith in computers doing deep human work.
And yet, predictably, futurists are jumping ahead to imagine "Watson apps" in which patients access the supercomputer for themselves. Even if there were reliable algorithms for doctoring, I reckon the "Watson app" is a giant step, because of the complex way the patient's conditions are assessed and data is gathered for the diagnosis. That is, the taking of the medical history.
In these days of billion dollar investments in electronic health records (EHRs), we tend to think that medical decisions are all about the data. When politicians announce EHR programs they often boast that patients won't have to go through the rigmarole of giving their history over and over again to multiple doctors as they move through an episode of care. This is actually a serious misunderstanding of the importance in clinical decision-making of the interaction between medico and patient when the history is taken. It's subtle. The things a patient chooses to tell, the things they seem to be hiding, and the questions that make them anxious, all guide an experienced medico when taking a history, and provide extra cues (metadata if you will) about the patient’s condition.
Now, Watson may well have the ability to navigate this complexity and conduct a very sophisticated Q&A. It will certainly have a vastly bigger and more reliable memory of cases than any doctor, and with that it can steer a dynamic patient questionnaire. But will Watson be good enough to be made available direct to patients through an app, with no expert human mediation? Or will a host of new input errors result from patients typing their answers into a smart phone or speaking into a microphone, without any face-to-face subtlety (let alone human warmth)? It was true of mainframes and it’s just as true of the best A.I.: Bulldust in, bulldust out.
Finally, Watson's existing linguistic limitations are not to be underestimated. It is surely not trivial that Watson struggles with puns and humour. Futurist Mark Pesce when discussing Watson remarked in passing that scientists don’t understand the "quirks of language and intelligence" that create humour. The question of what makes us laugh does in fact occupy some of the finest minds in cognitive and social science. So we are a long way from being able to mechanise humour. And this matters because for the foreseeable future, it puts a great deal of social intercourse beyond AI's reach.
In between the extremes of laugh-out-loud comedy and a doctor’s dry written notes lies a spectrum of expressive subtleties, like a blush, an uncomfortable laugh, shame, and the humiliation that goes with some patients’ lived experience of illness. Watson may understand the English language, but does it understand people?
Watson can answer questions, but good doctors ask a lot of questions too. When will this amazing computer be able to hold the sort of two-way conversation that we would call a decent "bedside manner"?
Have a disruptive technology implementation story? Get recognised for your leadership. Apply for the 2014 SuperNova Awards for leaders in disruptive technology.
The latest Snowden revelations include the NSA's special programs for extracting photos and identifying from the Internet. Amongst other things the NSA uses their vast information resources to correlate location cues in photos -- buildings, streets and so on -- with satellite data, to work out where people are. They even search especially for passport photos, because these are better fodder for facial recognition algorithms. The audacity of these government surveillance activities continues to surprise us, and their secrecy is abhorrent.
Yet an ever greater scale of private sector surveillance has been going on for years in social media. With great pride, Facebook recently revealed its R&D in facial recognition. They showcased the brazenly named "DeepFace" biometric algorithm, which is claimed to be 97% accurate in recognising faces from regular images. Facebook has made a swaggering big investment in biometrics.
Data mining needs raw material, there's lots of it out there, and Facebook has been supremely clever at attracting it. It's been suggested that 20% of all photos now taken end up in Facebook. Even three years ago, Facebook held 10,000 times as many photographs as the Library of Congress:
And Facebook will spend big buying other photo lodes. Last year they tried to buy Snapchat for the spectacular sum of three billion dollars. The figure had pundits reeling. How could a start-up company with 30 people be worth so much? All the usual dot com comparisons were made; the offer seemed a flight of fancy.
But no, the offer was a rational consideration for the precious raw material that lies buried in photo data.
Snapchat generates at least 100 million new images every day. Three billion dollars was, pardon me, a snap. I figure that at a ballpark internal rate of return of 10%, a $3B investment is equivalent to $300M p.a. so even if the Snapchat volume stopped growing, Facebook would have been paying one cent for every new snap, in perpetuity.
These days, we have learned from Snowden and the NSA that communications metadata is just as valuable as the content of our emails and phone calls. So remember that it's the same with photos. Each digital photo comes from a device that embeds within the image metadata usually including the time and place of when the picture was taken. And of course each Instagram or Snapchat is a social post, sent by an account holder with a history and rich context in which the image yields intimate real time information about what they're doing, when and where.
- When you access or use our Services, we automatically collect information about you, including:
- Usage Information: When you send or receive messages via our Services, we collect information about these messages, including the time, date, sender and recipient of the Snap. We also collect information about the number of messages sent and received between you and your friends and which friends you exchange messages with most frequently.
- Log Information: We log information about your use of our websites, including your browser type and language, access times, pages viewed, your IP address and the website you visited before navigating to our websites.
- Device Information: We may collect information about the computer or device you use to access our Services, including the hardware model, operating system and version, MAC address, unique device identifier, phone number, International Mobile Equipment Identity ("IMEI") and mobile network information. In addition, the Services may access your device's native phone book and image storage applications, with your consent, to facilitate your use of certain features of the Services.
Snapchat goes on to declare it may use any of this information to "personalize and improve the Services and provide advertisements, content or features that match user profiles or interests" and it reserves the right to share any information with "vendors, consultants and other service providers who need access to such information to carry out work on our behalf".
So back to the data mining: nothing stops Snapchat -- or a new parent company -- running biometric facial recognition over the snaps as they pass through the servers, to extract additional "profile" information. And there's an extra kicker that makes Snapchats extra valuable for biometric data miners. The vast majority of Snapchats are selfies. So if you extract a biometric template from a snap, you already know who it belongs to, without anyone having to tag it. Snapchat would provide a hundred million auto-calibrations every day for facial recognition algorithms! On Facebook, the privacy aware turn off photo tagging, but with Snapchats, self identification is inherent to the experience and is unlikely to be ever be disabled.
As I've discussed before, the morbid thrill of Snowden's spying revelations has tended to overshadow his sober observations that when surveillance by the state is probably inevitable, we need to be discussing accountability.
While we're all ventilating about the NSA, it's time we also attended to private sector spying and properly debated the restraints that may be appropriate on corporate exploitation of social data.
Personally I'm much more worried that an infomopoly has all my selfies.
Have a disruptive technology implementation story? Get recognised for your leadership. Apply for the 2014 SuperNova Awards for leaders in disruptive technology.
I've just completed a major new Constellation Research report looking at how today's privacy practices cope with Big Data. The report draws together my longstanding research on the counter-intuitive strengths of technology-neutral data protection laws, and melds it with my new Constellation colleagues' vast body of work in data analytics. The synergy is honestly exciting and illuminating.
Big Data promises tremendous benefits for a great many stakeholders but the potential gains are jeopardised by the excesses of a few. Some cavalier online businesses are propelled by a naive assumption that data in the "public domain" is up for grabs, and with that they often cross a line.
For example, there are apps and services now that will try to identify pictures you take of strangers in public, by matching them biometrically against data supersets compiled from social networking sites and other publically accessible databases. Many find such offerings quite creepy but they may be at a loss as to what to do about it, or even how to think through the issues objectively. Yet the very metaphor of data mining holds some of the clues. If, as some say, raw data is like crude oil, just waiting to be mined and exploited by enterprising prospecters, then surely there are limits, akin to mining permits?
Many think the law has not kept pace with technology, and that digital innovators are free to do what they like with any data they can get their hands on. But technologists repreatedly underestimate the strength of conventional data protection laws and regulations. The extraction of PII from raw data may be interpreted under technology neutral privacy principles as an act of Collection and as such is subject to existing statutes. Around the world, Google thus found they are not actually allowed to gather Personal Data that happens to be available in unencrypted Wi-Fi transmission as StreetView cars drive by homes and offices. And Facebook found they are not actually allowed to automatically identify people in photos through face recognition without consent. And Target probably would find, if they tried it outside the USA, that they cannot flag selected female customers as possibly pregnant by analysing their buying habits.
On the other hand, orthodox privacy policies and static user agreements do not cater for the way personal data can be conjured tomorrow from raw data collected today. Traditional privacy regimes require businesses to spell out what personally identifiable information (PII) they collect and why, and to restrict secondary usage. Yet with Big Data, with the best will in the world, a company might not know what data analytics will yield down the track. If mutual benefits for business and customer alike might be uncovered, a freeze-frame privacy arrangement may be counter-productive.
Thus the fit between data analytics and data privacy standards is complex and sometimes surprising. While existing laws are not to be underestimated, we do need something new. As far as I know it was Ray Wang in his Harvard Business Review blog who first called for a fresh privacy compact amongst users and businesses.
The spirit of data privacy is simply framed: organisations that know us should respect the knowledge they have, they should be open about what they know, and they should be restrained in what they do with it. In the Age of Big Data, let's have businesses respect the intelligence they extract from data mining, just as they should respect the knowledge they collect directly through forms and questionnaires.
I like the label "Big Privacy"; it is grandly optimistic, like "Big Data" itself, and at the same time implies a challenge to do better than regular privacy practices.
Ontario Privacy Commissioner Dr Ann Cavoukian writes about Big Privacy, describing it simply as "Privacy By Design writ large". But I think there must be more to it than that. Big Data is quantitatively but also qualitatively different from ordinary data analyis.
To summarise the basic elements of a Big Data compact:
- Respect and Restraint: In the face of Big Data’s temptations, remember that privacy is not only about what we do with PII; just as important is what we choose not to do.
- Super transparency: Who knows what lies ahead in Big Data? If data privacy means being open about what PII is collected and why, then advanced privacy means going further, telling people more about the business models and the sorts of results data mining is expected to return.
- Engage customers in a fair deal for PII: Information businesses ought to set out what PII is really worth to them (especially when it is extracted in non-obvious ways from raw data) and offer a fair "price" for it, whether in the form of "free" products and services, or explicit payment.
- Really innovate in privacy: There’s a common refrain that “privacy hampers innovation” but often that's an intellectually lazy cover for reserving the right to strip-mine PII. Real innovation lies in business practices which create and leverage PII while honoring privacy principles.
My report, "Big Privacy" Rises to the Challenges of Big Data may be downloaded from the Constellation Research website.
This is the abstract of a current privacy conference proposal.
Many Big Data and online businesses proceed on a naive assumption that data in the "public domain" is up for grabs; technocrats are often surprised that conventional data protection laws can be interpreted to cover the extraction of PII from raw data. On the other hand, orthodox privacy frameworks don't cater for the way PII can be created in future from raw data collected today. This presentation will bridge the conceptual gap between data analytics and privacy, and offer new dynamic consent models to civilize the trade in PII for goods and services.
It’s often said that technology has outpaced privacy law, yet by and large that's just not the case. Technology has certainly outpaced decency, with Big Data and biometrics in particular becoming increasingly invasive. However OECD data privacy principles set out over thirty years ago still serve us well. Outside the US, rights-based privacy law has proven effective against today's technocrats' most worrying business practices, based as they are on taking liberties with any data that comes their way. To borrow from Niels Bohr, technologists who are not surprised by data privacy have probably not understood it.
The cornerstone of data privacy in most places is the Collection Limitation principle, which holds that organizations should not collect Personally Identifiable Information beyond their express needs. It is the conceptual cousin of security's core Need-to-Know Principle, and the best starting point for Privacy-by-Design. The Collection Limitation principle is technology neutral and thus blind to the manner of collection. Whether PII is collected directly by questionnaire or indirectly via biometric facial recognition or data mining, data privacy laws apply.
If anonymity is important, what is the legal basis for defending it?
I find that conventional data privacy law in most places around the world already protects anonymity, insofar as the act of de-anonymization represents an act of PII Collection - the creation of a named record. As such, de-anonymization cannot be lawfully performed without an express need to to do, or consent.
Cynics have been asking the same rhetorical question "is privacy dead?" for at least 40 years. Certainly information technology and ubiquitous connectivity have made it nearly impossible to hide, and so anonymity is critically ill. But privacy is not the same thing as secrecy; privacy is a state where those who know us, respect the knowledge they have about us. Privacy generally doesn't require us hiding from anyone; it requires restraint on the part of those who hold Personal Information about us.
The typical public response to data breaches, government surveillance and invasions like social media facial recognition is vociferous. People in general energetically assert their rights to not be tracked online, or to have their personal information exploited behind their backs. These reactions show that the idea of privacy alive and well.
The end of anonymity perhaps
Against a backdrop of spying revelations and excesses by social media companies especially in regards to facial recognition, there have been recent calls for a "new jurisprudence of anonymity"; see Yale law professor Jed Rubenfeld writing in the Washington Post of 13 Jan 2014. I wonder if there is another way to crack the nut? Because any new jurisprudence is going to take a very long time.
Instead, I suggest we leverage the way most international privacy law and privacy experience -- going back decades -- is technology neutral with regards to the method of collection. In some jurisdictions like Australia, the term "collection" is not even defined in privacy law. Instead, the law just uses the normal plain English sense of the word, when it frames principles like Collection Limitation: basically, you are not allowed to collect (by any means) Personally Identifiable Information without a good reasonable express reason. It means that if PII gets into a data system, the system is accountable under privacy law for that PII, no matter how it got there.
This technology neutral view of PII collection has satisfying ramifications for all the people who intuit that Big Data has got too "creepy". We can argue that if a named record is produced afresh by a Big Data process (especially if that record is produced without the named person being aware of it, and from raw data that was originally collected for some other purpose) then that record has logically been collected. Whether PII is collected directly, or collected indirectly, or is in fact created by an obscure process, privacy law is largely agnostic.
Prof Rubenfeld wrote:
- "The NSA program isn’t really about gathering data. It's about mining data. All the data are there already, digitally stored and collected by telecom giants, just waiting." [italics in original]
I suggest that the output of the data mining, if it is personally identifiable and especially if it has been rendered identifiable by processing previously anonymous raw data, has is a fresh collection by the mining operation. As such, the miners should be accountable for their newly minted PII, just as though they had collected gathered it directly from the persons concerned.
For now, I don't want to go further and argue the rights and wrongs of surveillance. I just want to show a new way to frame the privacy questions in surveillance and big data, making use of existing jurisprudence. If I am right and the NSA is in effect collecting PII as it goes about its data mining, then that provides a possibly fresh understanding of what's going on, within which we can objectively analyse the rights and wrongs.
I am actually the first to admit that within this frame, the NSA might still be justified in mining data, and there might be no actual technical breach of information privacy law, if for instance the NSA enjoys a law enforcement exemption. These are important questions that need to be debated, but elsewhere (see my recent blog on our preparedness to actually have such a debate). My purpose right now is to frame a way to defend anonymity using as much existing legal infrastructure as possible.
But Collection is not limited everywhere
There is an important legal-technical question in all this: Is the collection of PII actually regulated? In Europe, Australia, New Zealand and in dozens of countries, collection is limited, but in the USA, there is no general restriction against collecting PII. America has no broad data protection law, and in any case, the Fair Information Practice Principles (FIPPs) don't include a Collection Limitation principle.
So there may be few regulations in the USA that would carry my argument there! Nevertheless, surely we can use international jurisprudence in Collection Limitation instead of creating new American jurisprudence around anonymity?
So I'd like to put the following questions Jed Rubenfeld:
- Do technology neutral Collection Limitation Principles in theory provide a way to bring de-anonymised data into scope for data privacy laws? Is this a way to address peoples' concerns with Big Data?
- How does international jurisprudence around Collection Limitation translate to American schools of legal thought?
- Does this way of looking at the problem create new impetus for Collection Limitation to be introduced into American privacy principles, especially the FIPPs?
Appendix: "Applying Information Privacy Norms to Re-Identification"
In 2013 I presented some of these ideas to an online symposium at the Harvard Law School Petrie-Flom Center, on the Law, Ethics & Science of Re-identification Demonstrations. What follows is an extract from that presentation, in which I spell out carefully the argument -- which was not obvious to some at the time -- that when genetics researchers combine different data sets to demonstrate re-identification of donated genomic material, they are in effect collecting patient PII. I argue that this type of collection should be subject to ethics committee approval just as if the researchers were collecting the identities from the patients directly.
... I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently. In the first, Yaniv Erlich [at MIT's Whitehead Institute] used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be "very hard". In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.
A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in de-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements. The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information. The implications for genomic data re-identification are clear.
Let’s consider Subject S who donates her DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S and links her to the DNA sample. The fact is that R2 has collected personal information about S. If R2 has no relationship with S, then S has not consented to this new collection of her personal information.
Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the DNA sample later represents a new collection, one that has been undertaken without any consent. Given that S has no knowledge of R2, there can be no implied consent in her original understanding with R1, even if absolute anonymity was disclaimed.
Naturally the re-identification demonstrations have served a purpose. It is undoubtedly important that the limits of anonymity be properly understood, and the work of Yaniv and Latanya contribute to that. Nevertheless, these demonstrations were undertaken without the knowledge much less the consent of the individuals concerned. I contend that bioinformaticians using clever techniques to attach names to anonymous samples need ethics approval, just as they would if they were taking fresh samples from the people concerned.
See also my letter to the editor of Science magazine.
Yesterday it was reported by The Verge that anonymous hackers have accessed Snapchat's user database and posted 4.6 million user names and phone numbers. In an apparent effort to soften the blow, two digits of the phone numbers were redacted. So we might assume this is a "white hat" exercise, designed to shame Snapchat into improving their security. Indeed, a few days ago Snapchat themselves said they had been warned of vulnerabilities in their APIs that would allow a mass upload of user records.
The response of many has been, well, so what? Some people have casually likened Snapchat's list to a public White Pages; others have played it down as "just email addresses".
Let's look more closely. The leaked list was not in fact public names and phone numbers; it was user names and phone numbers. User names might often be email addresses but these are typically aliases; people frequently choose email addresses that reveal little or nothing of their real world identity. We should assume there is intent in an obscure email address for the individual to remain secret.
Identity theft has become a highly organised criminal enterprise. Crime gangs patiently acquire multiple data sets over many months, sometimes years, gradually piecing together detailed personal profiles. It's been shown time and time again by privacy researchers (perhaps most notably Latanya Sweeney) that re-identification is enabled by linking diverse data sets. And for this purpose, email addresses and phone numbers are superbly valuable indices for correlating an individual's various records. Your email address is common across most of your social media registrations. And your phone number allows your real name and street address to be looked up from reverse White Pages. So the Snapchat breach could be used to join aliases or email addresses to real names and addresses via the phone numbers. For a social engineering attack on a call centre -- or even to open a new bank account -- an identity thief can go an awful long way with real name, street address, email address and phone number.
I was asked in an interview to compare the theft of stolen phone numbers with social security numbers. I surprised the interviewer when I said phone numbers are probably even more valuable to the highly organised ID thief, for they can be used to index names in public directories, and to link different data sets, in ways that SSNs (or credit card numbers for that matter) cannot.
So let us start to treat all personal inormation -- especially when aggregated in bulk -- more seriously! And let's be more cautious in the way we categorise personal or Personally Identifiable Information (PII).
Importantly, most regulatory definitions of PII already embody the proper degree of caution. Look carefully at the US government definition of Personally Identifiable Information:
- information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual (underline added).
This means that items of data can constitute PII if other data can be combined to identify the person concerned. That is, the fragments are regarded as PII even if it is the whole that does the identifying.
And remember that the middle I in PII stands for Identifiable, and not, as many people presume, Identifying. To meet the definition of PII, data need not uniquely identify a person, it merely needs to be directly or indirectly identifiable with a person. And this is how it should be when we heed the way information technologies enable identification through linkages.
Almost anywhere else in the world, data stores like Snapchat's would automatically fall under data protection and information privacy laws; regulators would take a close look at whether the company had complied with the OECD Privacy Principles, and whether Snapchat's security measures were fit for purpose given the PII concerned. But in the USA, companies and commentators alike still have trouble working out how serious these breaches are. Each new breach is treated in an ad hoc manner, often with people finessing the difference between credit card numbers -- as in the recent Target breach -- and "mere" email addresses like those in the Snapchat and Epsilon episodes.
Surely the time has come to simply give proper regulatory protection to all PII.
The cover of Newsweek magazine on 27 July 1970 featured an innocent couple being menaced by cameras and microphones and new technologies like computer punch cards and paper tape. The headline hollered “IS PRIVACY DEAD?”.
The same question has been posed every few years ever since.
In 1999, Sun Microsystems boss Scott McNally urged us to “get over” the idea we have “zero privacy”; in 2008, Ed Giorgio from the Office of the US Director of National Intelligence chillingly asserted that “privacy and security are a zero-sum game”; Facebook’s Mark Zuckerberg proclaimed in 2010 that privacy was no longer a “social norm”. And now the scandal around secret surveillance programs like PRISM and the Five Eyes’ related activities looks like another fatal blow to privacy. But the fact that cynics, security zealots and information magnates have been asking the same rhetorical question for over 40 years suggests that the answer is No!
PRISM, as revealed by whistle blower Ed Snowden, is a Top Secret electronic surveillance program of the US National Security Agency (NSA) to monitor communications traversing most of the big Internet properties including, allegedly, Apple, Facebook, Google, Microsoft, Skype, Yahoo and YouTube. Relatedly, intelligence agencies have evidently also been obtaining comprehensive call records from major telephone companies, eavesdropping on international optic fibre cables, and breaking into the cryptography many take for granted online.
In response, forces lined up at tweet speed on both sides of the stereotypical security-privacy divide. The “hawks” say privacy is a luxury in these times of terror, if you've done nothing wrong you have nothing to fear from surveillance, and in any case, much of the citizenry evidently abrogates privacy in the way they take to social networking. On the other side, libertarians claim this indiscriminate surveillance is the stuff of the Stasi, and by destroying civil liberties, we let the terrorists win.
Governments of course are caught in the middle. President Obama defended PRISM on the basis that we cannot have 100% security and 100% privacy. Yet frankly that’s an almost trivial proposition. It's motherhood. And it doesn’t help to inform any measured response to the law enforcement challenge, for we don’t have any tools that would let us design a computer system to an agreed specification in the form of, say “98% Security + 93% Privacy”. It’s silly to us the language of “balance” when we cannot measure the competing interests objectively.
Politicians say we need a community debate over privacy and national security, and they’re right (if not fully conscientious in framing the debate themselves). Are we ready to engage with these issues in earnest? Will libertarians and hawks venture out of their respective corners in good faith, to explore this difficult space?
I suggest one of the difficulties is that all sides tend to confuse privacy for secrecy. They’re not the same thing.
Privacy is a state of affairs where those who have Personal Information (PII) about us are constrained in how they use it. In daily life, we have few absolute secrets, but plenty of personal details. Not many people wish to live their lives underground; on the contrary we actually want to be well known by others, so long as they respect what they know about us. Secrecy is a sufficient but not necessary condition for privacy. Robust privacy regulations mandate strict limits on what PII is collected, how it is used and re-used, and how it is shared.
Therefore I am a privacy optimist. Yes, obviously too much PII has broken the banks in cyberspace, yet it is not necessarily the case that any “genie” is “out of the bottle”.
If PII falls into someone’s hands, privacy and data protection legislation around the world provides strong protection against re-use. For instance, in Australia Google was found to have breached the Privacy Act when its StreetView cars recorded unencrypted Wi-Fi transmissions; the company cooperated in deleting the data concerned. In Europe, Facebook’s generation of tag suggestions without consent by biometric processes was ruled unlawful; regulators there forced Facebook to cease facial recognition and delete all old templates.
We might have a better national security debate if we more carefully distinguished privacy and secrecy.
I see no reason why Big Data should not be a legitimate tool for law enforcement. I have myself seen powerful analytical tools used soon after a terrorist attack to search out patterns in call records in the vicinity to reveal suspects. Until now, there has not been the technological capacity to use these tools pro-actively. But with sufficient smarts, raw data and computing power, it is surely a reasonable proposition that – with proper and transparent safeguards in place – population-wide communications metadata can be screened to reveal organised crimes in the making.
A more sophisticated and transparent government position might ask the public to give up a little secrecy in the interests of national security. The debate should not be polarised around the falsehood that security and privacy are at odds. Instead we should be debating and negotiating appropriate controls around selected metadata to enable effective intelligence gathering while precluding unexpected re-use. If (and only if) credible and verifiable safeguards can be maintained to contain the use and re-use of personal communications data, then so can our privacy.
For me the awful thing about PRISM is not that metadata is being mined; it’s that we weren’t told about it. Good governments should bring the citizenry into their confidence.
Are we prepared to honestly debate some awkward questions?
- Has the world really changed in the past 10 years such that surveillance is more necessary now? Should the traditional balances of societal security and individual liberties enshrined in our traditional legal structures be reviewed for a modern world?
- Has the Internet really changed the risk landscape, or is it just another communications mechanism. Is the Internet properly accommodated by centuries old constitutions?
- How can we have confidence in government authorities to contain their use of communications metadata? Is it possible for trustworthy new safeguards to be designed?
Many years ago, cryptographers adopted a policy of transparency. They have forsaken secret encryption algorithms, so that the maths behind these mission critical mechanisms is exposed to peer review and ongoing scrutiny. Secret algorithms are fragile in the long term because it’s only a matter of time before someone exposes them and weakens their effectiveness. Security professionals have a saying: “There is no security in obscurity”.
For precisely the same reason, we must not have secret government monitoring programs either. If the case is made that surveillance is a necessary evil, then it would actually be in everyone’s interests for governments to run their programs out in the open.
As we head towards 2014, de-identification of personal data sets is going to be a hot issue. I saw several things at last week's Constellation Connected Enterprise conference (CCE) that will make sure of this!
First, recall that in Australia a new definition of Personal Information (PI or "PII") means that anonymous data that can potentially be re-identified in future may have to be classified as PII today. I recently discussed how security and risk practitioners can deal with the uncertainty in re-identifiability.
And there's a barrage of new tracking, profiling and interior geo-location technologies (like Apple's iBeacon) which typically come with a promise of anonymity. See for example Tesco's announcement of face scanning for targeting adverts at their UK petrol stations.
The promise of anonymity is crucial, but it is increasingly hard to keep. Big Data techniques that join de-identified information to other data sets are able to ind correlations and reverse the anonymisation process. The science of re-identification started with the work of Dr Latanya Sweeny who famously identified a former governor and his medical records using zip codes and electoral roll data; more recently we've seen DNA "hackers" who can unmask anonymous DNA donors by joining genomic databases to public family tree information.
At CCE we saw many exciting Big Data developments, which I'll explore in more detail in coming weeks. Business Intelligence as-a-service is expanding rapidly, and is being flipped my innovative vendors to align (whether consciously or not) with customer centric Vendor Relationship Management models of doing business. And there are amazing new tools for enriching unstructured data, like newly launched Paxata's Adaptive Data Preparation Platform. More to come.
With the ability to re-identify data comes Big Responsibilities. I believe that to help businesses meet their privacy promises, we're going to need new tools to measure de-identification and hence gauge the risk of re-identification. It seems that some new generation data analytics products will allow us to run what-if scenarios to help understand the risks.
Just before CCE I also came across some excellent awareness raising materials from Voltage Security in Cupertino. Voltage CTO Terence Spies shared with me his "Deidentification Taxonomy" reproduced here with his kind permission. Voltage are leaders in Format Preserving Encryption and Tokenization -- typically used to hide credit card numbers from thieves in payment systems -- and they're showing how the tools may be used more broadly for de-identifying databases. I like the way Terence has characterised the reversibility (or not) of de-identification approaches, and further broken out various tokenization technologies.
Reference: Voltage Security. Reproduced with permission.
These are the foundations of the important new science of de-identification. Privacy engineers need to work hard at re-identification, so that consumers do not lose faith in the important promises made that so much data collected from their daily movements through cyber space are indeed anonymous.