Second Day Reflections from CIS Monterey.
Follow along on Twitter at #CISmcc (for the Monterey Conference Centre).
The attributes push
At CIS 2013 in Napa a year ago, several of us sensed a critical shift in focus amongst the identerati - from identity to attributes. OIX launched the Attributes Exchange Network (AXN) architecture, important commentators like Andrew Nash were saying, 'hey, attributes are more interesting than identity', and my own #CISnapa talk went so far as to argue we should forget about identity altogether. There was a change in the air, but still, it was all pretty theoretical.
Twelve months on, and the Attributes push has become entirely practical. If there was a Word Cloud for the NSTIC session, my hunch is that "attributes" would dominate over "identity". Several live NSTIC pilots are all about the Attributes.
ID.me is a new company started by US military veterans, with the aim of improving access for the veterans community to discounted goods and services and other entitlements. Founders Matt Thompson and Blake Hall are not identerati -- they're entirely focused on improving online access for their constituents to a big and growing range of retailers and services, and offer a choice of credentials for proving veterans bona fides. It's central to the ID.me model that users reveal as little as possible about their personal identities, while having their veterans' status and entitlements established securely and privately.
Another NSTIC pilot Relying Party is the financial service sector infrastructure provider Broadridge. Adrian Chernoff, VP for Digital Strategy, gave a compelling account of the need to change business models to take maximum advantage of digital identity. Broadridge recently annoucned a JV with Pitney Bowes called Inlet, which will enable the secure sharing of discrete and validated attributes - like name, address and social security number - in an NSTIC compliant architecture.
Yesterday I said in my CISmcc diary that I hoped to change my mind at #CISmcc about something, and half way through Day 2, I was delighted it was already happening. I've got a new attitude about NSTIC.
Over the past six months, I had come to fear http://www.nist.gov/nstic/">NSTIC had lost its way. It's hard to judge totally accurately when lurking on the webcast from Sydney (at 4:00am) but the last plenary seemed pedestrian to me. And I'm afraid to say that some NSTIC committees have got a little testy. But today's NSTIC session here was a turning point. Not only are there a number or truly exciting pilots showing real progress, but Jeremy Grant has credible plans for improving accountability and momentum, and the new technology lead Paul Grassi is thinking outside the box and speaking out of school. The whole program seems fresh all over again.
In a packed presentation, Grassi impressed me enormously on a number of points:
- Firstly, he advocates a pragmatic NSTIC-focused extension of the old US government Authentication Guide NIST SP 800-63. Rather than a formal revision, a companion document might be most realistic. Along the way, Grassi really nailed an issue which we identity professionals need to talk about more: language. He said that there are words in 800-63 that are "never used anywhere else in systems development". No wonder, as he says, it's still "hard to implement identity"!
- Incidentally I chatted some more with Andrew Hughes about language; he is passionate about terms, and highlights that our term "Relying Party" is an especially terrible distraction for Service Providers whose reason-for-being has nothing to do with "relying" on anyone!
- Secondly, Paul Grassi wants to "get very aggressive on attributes", including emphasis on practical measurement (since that's really what NIST is all about). I don't think I need to say anything more about that than Bravo!
- And thirdly, Grassi asked "What if we got rid of LOAs?!". This kind of iconoclastic thinking is overdue, and was floated as part of a broad push to revamp the way government's orthodox thinking on Identity Assurance is translated to the business world. Grassi and Grant don't say LOAs can or should be abandoned by government, but they do see that shoving the rounded business concepts of identity into government's square hole has not done anyone much credit.
Just one small part of NSTIC annoyed me today: the persistent idea that federation hubs are inherently simpler than one-to-one authentication. They showed the following classic sort of 'before and after' shots, where it seems self-evident that a hub (here the Federal Cloud Credential Exchange FCCX) reduces complexity. The reality is that multilateral brokered arrangements between RPs and IdPs are far more complex than simple bilateral direct contracts. And moreover, the new forms of agreements are novel and untested in real world business. The time and cost and unpredictability of working out these new arrangements is not properly accounted for and has often been fatal to identity federations.
The dog barks and this time the caravan turns around
One of the top talking points at #CISmcc has of course been FIDO. The FIDO Alliance goes from strength to strength; we heard they have over 130 members now (remember it started with four or five less than 18 months ago). On Saturday afternoon there was a packed-out FIDO show case with six vendors showing real FIDO-ready products. And today there was a three hour deep dive into the two flagship FIDO protocols UAF (which enables better sharing of strong authentication signals such that passwords may be eliminated) to and U2F (which standardises and strengthens Two Factor Authentication).
FIDO's marketing messages are improving all the time, thanks to a special focus on strategic marketing which was given its own working group. In particular, the Alliance is steadily clarifying the distinction between identity and authentication, and sticking adamantly to the latter. In other words, FIDO is really all about the attributes. FIDO leaves identity as a problem to be addressed further up the stack, and dedicates itself to strengthening the authentication signal sent from end-point devices to servers.
The protocol tutorials were excellent, going into detail about how "Attestation Certificates" are used to convey the qualities and attributes of authentication hardware (such as device model, biometric modality, security certifications, elapsed time since last user verification etc) thus enabling nice fine-grained policy enforcement on the RP side. To my mind, UAF and U2F show how nature intended PKI to have been used all along!
Some confusion remains as to why FIDO has two protocols. I heard some quiet calls for UAF and U2F to converge, yet that would seem to put the elegance of U2F at risk. And it's noteworthy that U2F is being taken beyond the original one time password 2FA, with at least one biometric vendor at the showcase claiming to use it instead of the heavier UAF.
Surprising use cases
Finally, today brought more fresh use cases from cohorts of users we socially privileged identity engineers for the most part rarely think about. Another NSTIC pilot partner is AARP, a membership organization providing "information, advocacy and service" to older people, retirees and other special needs groups. AARP's Jim Barnett gave a compelling presentation on the need to extend from the classic "free" business models of Internet services, to new economically sustainable approaches that properly protect personal information. Barnett stressed that "free" has been great and 'we wouldn't be where we are today without it' but it's just not going to work for health records for example. And identity is central to that.
There's so much more I could report if I had time. But I need to get some sleep before another packed day. All this changing my mind is exhausting.
Cheers again from Monterey.
We live in an age where billionaires are self-made on the back of the most intangible of assets – the information they have amassed about us. That information used to be volunteered in forms and questionnaires and contracts but increasingly personal information is being observed and inferred.
The modern world is awash with data. It’s a new and infinitely re-usable raw material. Most of the raw data about us is an invisible by-product of our mundane digital lives, left behind by the gigabyte by ordinary people who do not perceive it let alone understand it.
Many Big Data and digital businesses proceed on the basis that all this raw data is up for grabs. There is a particular widespread assumption that data in the "public domain" is free-for-all, and if you’re clever enough to grab it, then you’re entitled to extract whatever you can from it.
In the webinar, I'll try to show how some of these assumptions are naive. The public is increasingly alarmed about Big Data and averse to unbridled data mining. Excessive data mining isn't just subjectively 'creepy'; it can be objectively unlawful in many parts of the world. Conventional data protection laws turn out to be surprisingly powerful in in the face of Big Data. Data miners ignore international privacy laws at their peril!
Today there are all sorts of initiatives trying to forge a new technology-privacy synthesis. They go by names like "Privacy Engineering" and "Privacy by Design". These are well meaning efforts but they can be a bit stilted. They typically overlook the strengths of conventional privacy law, and they can miss an opportunity to engage the engineering mind.
It’s not politically correct but I believe we must admit that privacy is full of contradictions and competing interests. We need to be more mature about privacy. Just as there is no such thing as perfect security, there can never be perfect privacy either. And is where the professional engineering mindset should be brought in, to help deal with conflicting requirements.
If we’re serious about Privacy by Design and Privacy Engineering then we need to acknowledge the tensions. That’s some of the thinking behind Constellation's new Big Privacy compact. To balance privacy and Big Data, we need to hold a conversation with users that respects the stresses and strains, and involves them in working through the new privacy deal.
The webinar will cover these highlights of the Big Privacy pact:
- Respect and Restraint
- Super transparency
- And a fair deal for Personal Information.
Have a disruptive technology implementation story? Get recognised for your leadership. Apply for the 2014 SuperNova Awards for leaders in disruptive technology.
The latest Snowden revelations include the NSA's special programs for extracting photos and identifying from the Internet. Amongst other things the NSA uses their vast information resources to correlate location cues in photos -- buildings, streets and so on -- with satellite data, to work out where people are. They even search especially for passport photos, because these are better fodder for facial recognition algorithms. The audacity of these government surveillance activities continues to surprise us, and their secrecy is abhorrent.
Yet an ever greater scale of private sector surveillance has been going on for years in social media. With great pride, Facebook recently revealed its R&D in facial recognition. They showcased the brazenly named "DeepFace" biometric algorithm, which is claimed to be 97% accurate in recognising faces from regular images. Facebook has made a swaggering big investment in biometrics.
Data mining needs raw material, there's lots of it out there, and Facebook has been supremely clever at attracting it. It's been suggested that 20% of all photos now taken end up in Facebook. Even three years ago, Facebook held 10,000 times as many photographs as the Library of Congress:
And Facebook will spend big buying other photo lodes. Last year they tried to buy Snapchat for the spectacular sum of three billion dollars. The figure had pundits reeling. How could a start-up company with 30 people be worth so much? All the usual dot com comparisons were made; the offer seemed a flight of fancy.
But no, the offer was a rational consideration for the precious raw material that lies buried in photo data.
Snapchat generates at least 100 million new images every day. Three billion dollars was, pardon me, a snap. I figure that at a ballpark internal rate of return of 10%, a $3B investment is equivalent to $300M p.a. so even if the Snapchat volume stopped growing, Facebook would have been paying one cent for every new snap, in perpetuity.
These days, we have learned from Snowden and the NSA that communications metadata is just as valuable as the content of our emails and phone calls. So remember that it's the same with photos. Each digital photo comes from a device that embeds within the image metadata usually including the time and place of when the picture was taken. And of course each Instagram or Snapchat is a social post, sent by an account holder with a history and rich context in which the image yields intimate real time information about what they're doing, when and where.
- When you access or use our Services, we automatically collect information about you, including:
- Usage Information: When you send or receive messages via our Services, we collect information about these messages, including the time, date, sender and recipient of the Snap. We also collect information about the number of messages sent and received between you and your friends and which friends you exchange messages with most frequently.
- Log Information: We log information about your use of our websites, including your browser type and language, access times, pages viewed, your IP address and the website you visited before navigating to our websites.
- Device Information: We may collect information about the computer or device you use to access our Services, including the hardware model, operating system and version, MAC address, unique device identifier, phone number, International Mobile Equipment Identity ("IMEI") and mobile network information. In addition, the Services may access your device's native phone book and image storage applications, with your consent, to facilitate your use of certain features of the Services.
Snapchat goes on to declare it may use any of this information to "personalize and improve the Services and provide advertisements, content or features that match user profiles or interests" and it reserves the right to share any information with "vendors, consultants and other service providers who need access to such information to carry out work on our behalf".
So back to the data mining: nothing stops Snapchat -- or a new parent company -- running biometric facial recognition over the snaps as they pass through the servers, to extract additional "profile" information. And there's an extra kicker that makes Snapchats extra valuable for biometric data miners. The vast majority of Snapchats are selfies. So if you extract a biometric template from a snap, you already know who it belongs to, without anyone having to tag it. Snapchat would provide a hundred million auto-calibrations every day for facial recognition algorithms! On Facebook, the privacy aware turn off photo tagging, but with Snapchats, self identification is inherent to the experience and is unlikely to be ever be disabled.
As I've discussed before, the morbid thrill of Snowden's spying revelations has tended to overshadow his sober observations that when surveillance by the state is probably inevitable, we need to be discussing accountability.
While we're all ventilating about the NSA, it's time we also attended to private sector spying and properly debated the restraints that may be appropriate on corporate exploitation of social data.
Personally I'm much more worried that an infomopoly has all my selfies.
Have a disruptive technology implementation story? Get recognised for your leadership. Apply for the 2014 SuperNova Awards for leaders in disruptive technology.
I've just completed a major new Constellation Research report looking at how today's privacy practices cope with Big Data. The report draws together my longstanding research on the counter-intuitive strengths of technology-neutral data protection laws, and melds it with my new Constellation colleagues' vast body of work in data analytics. The synergy is honestly exciting and illuminating.
Big Data promises tremendous benefits for a great many stakeholders but the potential gains are jeopardised by the excesses of a few. Some cavalier online businesses are propelled by a naive assumption that data in the "public domain" is up for grabs, and with that they often cross a line.
For example, there are apps and services now that will try to identify pictures you take of strangers in public, by matching them biometrically against data supersets compiled from social networking sites and other publically accessible databases. Many find such offerings quite creepy but they may be at a loss as to what to do about it, or even how to think through the issues objectively. Yet the very metaphor of data mining holds some of the clues. If, as some say, raw data is like crude oil, just waiting to be mined and exploited by enterprising prospecters, then surely there are limits, akin to mining permits?
Many think the law has not kept pace with technology, and that digital innovators are free to do what they like with any data they can get their hands on. But technologists repreatedly underestimate the strength of conventional data protection laws and regulations. The extraction of PII from raw data may be interpreted under technology neutral privacy principles as an act of Collection and as such is subject to existing statutes. Around the world, Google thus found they are not actually allowed to gather Personal Data that happens to be available in unencrypted Wi-Fi transmission as StreetView cars drive by homes and offices. And Facebook found they are not actually allowed to automatically identify people in photos through face recognition without consent. And Target probably would find, if they tried it outside the USA, that they cannot flag selected female customers as possibly pregnant by analysing their buying habits.
On the other hand, orthodox privacy policies and static user agreements do not cater for the way personal data can be conjured tomorrow from raw data collected today. Traditional privacy regimes require businesses to spell out what personally identifiable information (PII) they collect and why, and to restrict secondary usage. Yet with Big Data, with the best will in the world, a company might not know what data analytics will yield down the track. If mutual benefits for business and customer alike might be uncovered, a freeze-frame privacy arrangement may be counter-productive.
Thus the fit between data analytics and data privacy standards is complex and sometimes surprising. While existing laws are not to be underestimated, we do need something new. As far as I know it was Ray Wang in his Harvard Business Review blog who first called for a fresh privacy compact amongst users and businesses.
The spirit of data privacy is simply framed: organisations that know us should respect the knowledge they have, they should be open about what they know, and they should be restrained in what they do with it. In the Age of Big Data, let's have businesses respect the intelligence they extract from data mining, just as they should respect the knowledge they collect directly through forms and questionnaires.
I like the label "Big Privacy"; it is grandly optimistic, like "Big Data" itself, and at the same time implies a challenge to do better than regular privacy practices.
Ontario Privacy Commissioner Dr Ann Cavoukian writes about Big Privacy, describing it simply as "Privacy By Design writ large". But I think there must be more to it than that. Big Data is quantitatively but also qualitatively different from ordinary data analyis.
To summarise the basic elements of a Big Data compact:
- Respect and Restraint: In the face of Big Data’s temptations, remember that privacy is not only about what we do with PII; just as important is what we choose not to do.
- Super transparency: Who knows what lies ahead in Big Data? If data privacy means being open about what PII is collected and why, then advanced privacy means going further, telling people more about the business models and the sorts of results data mining is expected to return.
- Engage customers in a fair deal for PII: Information businesses ought to set out what PII is really worth to them (especially when it is extracted in non-obvious ways from raw data) and offer a fair "price" for it, whether in the form of "free" products and services, or explicit payment.
- Really innovate in privacy: There’s a common refrain that “privacy hampers innovation” but often that's an intellectually lazy cover for reserving the right to strip-mine PII. Real innovation lies in business practices which create and leverage PII while honoring privacy principles.
My report, "Big Privacy" Rises to the Challenges of Big Data may be downloaded from the Constellation Research website.
In one of the most highly anticipated sessions ever at the annual South-by-Southwest (SXSW) culture festival, NSA whistle blower Ed Snowden appeared via live video link from Russia. He joined two privacy and security champions from the American Civil Liberties Union – Chris Soghoian and Ben Wizner – to canvass the vexed tensions between intelligence and law enforcement, personal freedom, government accountability and digital business models.
These guys traversed difficult ground, with respect and much nuance. They agreed the issues are tough, and that proper solutions are non-obvious and slow-coming. The transcript is available here.
Yet afterwards the headlines and tweet stream were dominated by "Snowden's Tips" for personal online security. It was as if Snowden had been conducting a self-help workshop or a Cryptoparty. He was reported to recommend we encrypt our hard drives, encrypt our communications, and use Tor (the special free-and-open-source encrypted browser). These are mostly fine suggestions but I am perplexed why they should be the main takeaways from a complex discussion. Are people listening to Snowdenis broader and more general policy lessons? I fear not. I believe people still conflate secrecy and privacy. At the macro level, the confusion makes it difficult to debate national security policy properly; at a micro level, even if crypto was practical for typical citizens, it is not a true privacy measure. Citizens need so much more than secrecy technologies, whether it's SSL-always-on at web sites, or do-it-yourself encryption.
Ed Snowden is a remarkably measured and thoughtful commentator on national security. Despite being hounded around the word, he is not given to sound bites. His principal concerns appear to be around public accountability, oversight and transparency. He speaks of the strengths and weaknesses of the governance systems already in place; he urges Congress to hold security agency heads to account.
When drawn on questions of technology, he doesn't dispense casual advice; instead he calls for multifaceted responses to our security dilemmas: more cryptological research, better random number generators, better testing, more robust cryptographic building blocks and more careful product design. Deep, complicated engineering stuff.
So how did the media, both mainstream and online alike, distill Snowden's sweeping analysis of politics, policy and engineering into three sterile and quasi-survivalist snippets?
Partly it's due to the good old sensationalism of all modern news media: everyone likes a David-and-Goliath angle where individuals face off against pitiless governments. And there's also the ruthless compression: newspapers cater for an audience with school-age reading levels and attention spans, and Twitter clips our contributions to 140 characters.
But there is also a deeper over-simplification of privacy going on which inhibits our progress.
Too often, people confuse privacy for secrecy. Privacy gets framed as a need to hide from prying eyes, and from that starting position, many advocates descend into a combative, everyone-for-themselves mindset.
However privacy has very little to do with secrecy. We shouldn't have to go underground to enjoy that fundamental human right to be let alone. The social reality is that most of us wish to lead rich and quite public lives. We actually want others to know us – to know what we do, what we like, and what we think – but all within limits. Digital privacy (or more clinically, data protection) is not about hiding; rather it is a state where those who know us are restrained in what they do with the knowledge they have about us.
Privacy is the protection you need when your affairs are not confidential!
So encryption is a sterile and very limited privacy measure. As the SXSW panellists agreed, today's encryption tools really are the preserve of deep technical specialists. Ben Wizner quipped that if the question is how can average users protect themselves online, and the answer is Tor, then "we have failed".
And the problems with cryptography are not just usability and customer experience. A fundamental challenge with the best encryption is that everyone needs to be running the tools. You cannot send out encrypted email unilaterally – you need to first make sure all your correspondents have installed the right software and they've got trusted copies of your encryption keys, or they won't be able to unscramble your messages.
Chris Soghoian also nailed the business problem that current digital revenue models are largely incompatible with encryption. The wondrous free services we enjoy from the Googles and Facebooks of the world are funded in the main by mining our data streams, figuring out our interests, habits and connections, and monetising that synthesised information. The web is in fact bankrolled by surveillance – by Big Business as opposed to government.
End-to-end encryption prevents data mining and would ruin the business model of the companies we've become attached to. If we were to get serious with encryption, we may have to cough up the true price for our modern digital lifestyles.
The SXSW privacy and security panellists know all this. Snowden in particular spent much of his time carefully reiterating many of the basics of data privacy. For instance he echoed the Collection Limitation Principle when he said of large companies that they "can't collect any data; [they] should only collect data and hold it for as long as necessary for the operation of the business". And the Openness Principle: "data should not be collected without people's knowledge and consent". If I was to summarise Snowden's SXSW presentation, I'd say privacy will only be improved by reforming the practices of both governments and big businesses, and by putting far more care into digital product development. Ed Snowden himself doesn't promote neat little technology tips.
It's still early days for the digital economy. We're experiencing an online re-run of the Wild West, with humble users understandably feeling forced to take measures into their own hands. So many individuals have become hungry for defensive online tools and tips. But privacy is more about politics and regulation than technology. I hope that people listen more closely to Ed Snowden on policy, and that his lasting legacy is more about legal reform and transparency than Do-It-Yourself encryption.
This is the abstract of a current privacy conference proposal.
Many Big Data and online businesses proceed on a naive assumption that data in the "public domain" is up for grabs; technocrats are often surprised that conventional data protection laws can be interpreted to cover the extraction of PII from raw data. On the other hand, orthodox privacy frameworks don't cater for the way PII can be created in future from raw data collected today. This presentation will bridge the conceptual gap between data analytics and privacy, and offer new dynamic consent models to civilize the trade in PII for goods and services.
It’s often said that technology has outpaced privacy law, yet by and large that's just not the case. Technology has certainly outpaced decency, with Big Data and biometrics in particular becoming increasingly invasive. However OECD data privacy principles set out over thirty years ago still serve us well. Outside the US, rights-based privacy law has proven effective against today's technocrats' most worrying business practices, based as they are on taking liberties with any data that comes their way. To borrow from Niels Bohr, technologists who are not surprised by data privacy have probably not understood it.
The cornerstone of data privacy in most places is the Collection Limitation principle, which holds that organizations should not collect Personally Identifiable Information beyond their express needs. It is the conceptual cousin of security's core Need-to-Know Principle, and the best starting point for Privacy-by-Design. The Collection Limitation principle is technology neutral and thus blind to the manner of collection. Whether PII is collected directly by questionnaire or indirectly via biometric facial recognition or data mining, data privacy laws apply.
If anonymity is important, what is the legal basis for defending it?
I find that conventional data privacy law in most places around the world already protects anonymity, insofar as the act of de-anonymization represents an act of PII Collection - the creation of a named record. As such, de-anonymization cannot be lawfully performed without an express need to to do, or consent.
Cynics have been asking the same rhetorical question "is privacy dead?" for at least 40 years. Certainly information technology and ubiquitous connectivity have made it nearly impossible to hide, and so anonymity is critically ill. But privacy is not the same thing as secrecy; privacy is a state where those who know us, respect the knowledge they have about us. Privacy generally doesn't require us hiding from anyone; it requires restraint on the part of those who hold Personal Information about us.
The typical public response to data breaches, government surveillance and invasions like social media facial recognition is vociferous. People in general energetically assert their rights to not be tracked online, or to have their personal information exploited behind their backs. These reactions show that the idea of privacy alive and well.
The end of anonymity perhaps
Against a backdrop of spying revelations and excesses by social media companies especially in regards to facial recognition, there have been recent calls for a "new jurisprudence of anonymity"; see Yale law professor Jed Rubenfeld writing in the Washington Post of 13 Jan 2014. I wonder if there is another way to crack the nut? Because any new jurisprudence is going to take a very long time.
Instead, I suggest we leverage the way most international privacy law and privacy experience -- going back decades -- is technology neutral with regards to the method of collection. In some jurisdictions like Australia, the term "collection" is not even defined in privacy law. Instead, the law just uses the normal plain English sense of the word, when it frames principles like Collection Limitation: basically, you are not allowed to collect (by any means) Personally Identifiable Information without a good reasonable express reason. It means that if PII gets into a data system, the system is accountable under privacy law for that PII, no matter how it got there.
This technology neutral view of PII collection has satisfying ramifications for all the people who intuit that Big Data has got too "creepy". We can argue that if a named record is produced afresh by a Big Data process (especially if that record is produced without the named person being aware of it, and from raw data that was originally collected for some other purpose) then that record has logically been collected. Whether PII is collected directly, or collected indirectly, or is in fact created by an obscure process, privacy law is largely agnostic.
Prof Rubenfeld wrote:
- "The NSA program isn’t really about gathering data. It's about mining data. All the data are there already, digitally stored and collected by telecom giants, just waiting." [italics in original]
I suggest that the output of the data mining, if it is personally identifiable and especially if it has been rendered identifiable by processing previously anonymous raw data, has is a fresh collection by the mining operation. As such, the miners should be accountable for their newly minted PII, just as though they had collected gathered it directly from the persons concerned.
For now, I don't want to go further and argue the rights and wrongs of surveillance. I just want to show a new way to frame the privacy questions in surveillance and big data, making use of existing jurisprudence. If I am right and the NSA is in effect collecting PII as it goes about its data mining, then that provides a possibly fresh understanding of what's going on, within which we can objectively analyse the rights and wrongs.
I am actually the first to admit that within this frame, the NSA might still be justified in mining data, and there might be no actual technical breach of information privacy law, if for instance the NSA enjoys a law enforcement exemption. These are important questions that need to be debated, but elsewhere (see my recent blog on our preparedness to actually have such a debate). My purpose right now is to frame a way to defend anonymity using as much existing legal infrastructure as possible.
But Collection is not limited everywhere
There is an important legal-technical question in all this: Is the collection of PII actually regulated? In Europe, Australia, New Zealand and in dozens of countries, collection is limited, but in the USA, there is no general restriction against collecting PII. America has no broad data protection law, and in any case, the Fair Information Practice Principles (FIPPs) don't include a Collection Limitation principle.
So there may be few regulations in the USA that would carry my argument there! Nevertheless, surely we can use international jurisprudence in Collection Limitation instead of creating new American jurisprudence around anonymity?
So I'd like to put the following questions Jed Rubenfeld:
- Do technology neutral Collection Limitation Principles in theory provide a way to bring de-anonymised data into scope for data privacy laws? Is this a way to address peoples' concerns with Big Data?
- How does international jurisprudence around Collection Limitation translate to American schools of legal thought?
- Does this way of looking at the problem create new impetus for Collection Limitation to be introduced into American privacy principles, especially the FIPPs?
Appendix: "Applying Information Privacy Norms to Re-Identification"
In 2013 I presented some of these ideas to an online symposium at the Harvard Law School Petrie-Flom Center, on the Law, Ethics & Science of Re-identification Demonstrations. What follows is an extract from that presentation, in which I spell out carefully the argument -- which was not obvious to some at the time -- that when genetics researchers combine different data sets to demonstrate re-identification of donated genomic material, they are in effect collecting patient PII. I argue that this type of collection should be subject to ethics committee approval just as if the researchers were collecting the identities from the patients directly.
... I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently. In the first, Yaniv Erlich [at MIT's Whitehead Institute] used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be "very hard". In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.
A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in de-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements. The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information. The implications for genomic data re-identification are clear.
Let’s consider Subject S who donates her DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S and links her to the DNA sample. The fact is that R2 has collected personal information about S. If R2 has no relationship with S, then S has not consented to this new collection of her personal information.
Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the DNA sample later represents a new collection, one that has been undertaken without any consent. Given that S has no knowledge of R2, there can be no implied consent in her original understanding with R1, even if absolute anonymity was disclaimed.
Naturally the re-identification demonstrations have served a purpose. It is undoubtedly important that the limits of anonymity be properly understood, and the work of Yaniv and Latanya contribute to that. Nevertheless, these demonstrations were undertaken without the knowledge much less the consent of the individuals concerned. I contend that bioinformaticians using clever techniques to attach names to anonymous samples need ethics approval, just as they would if they were taking fresh samples from the people concerned.
See also my letter to the editor of Science magazine.
Yesterday it was reported by The Verge that anonymous hackers have accessed Snapchat's user database and posted 4.6 million user names and phone numbers. In an apparent effort to soften the blow, two digits of the phone numbers were redacted. So we might assume this is a "white hat" exercise, designed to shame Snapchat into improving their security. Indeed, a few days ago Snapchat themselves said they had been warned of vulnerabilities in their APIs that would allow a mass upload of user records.
The response of many has been, well, so what? Some people have casually likened Snapchat's list to a public White Pages; others have played it down as "just email addresses".
Let's look more closely. The leaked list was not in fact public names and phone numbers; it was user names and phone numbers. User names might often be email addresses but these are typically aliases; people frequently choose email addresses that reveal little or nothing of their real world identity. We should assume there is intent in an obscure email address for the individual to remain secret.
Identity theft has become a highly organised criminal enterprise. Crime gangs patiently acquire multiple data sets over many months, sometimes years, gradually piecing together detailed personal profiles. It's been shown time and time again by privacy researchers (perhaps most notably Latanya Sweeney) that re-identification is enabled by linking diverse data sets. And for this purpose, email addresses and phone numbers are superbly valuable indices for correlating an individual's various records. Your email address is common across most of your social media registrations. And your phone number allows your real name and street address to be looked up from reverse White Pages. So the Snapchat breach could be used to join aliases or email addresses to real names and addresses via the phone numbers. For a social engineering attack on a call centre -- or even to open a new bank account -- an identity thief can go an awful long way with real name, street address, email address and phone number.
I was asked in an interview to compare the theft of stolen phone numbers with social security numbers. I surprised the interviewer when I said phone numbers are probably even more valuable to the highly organised ID thief, for they can be used to index names in public directories, and to link different data sets, in ways that SSNs (or credit card numbers for that matter) cannot.
So let us start to treat all personal inormation -- especially when aggregated in bulk -- more seriously! And let's be more cautious in the way we categorise personal or Personally Identifiable Information (PII).
Importantly, most regulatory definitions of PII already embody the proper degree of caution. Look carefully at the US government definition of Personally Identifiable Information:
- information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual (underline added).
This means that items of data can constitute PII if other data can be combined to identify the person concerned. That is, the fragments are regarded as PII even if it is the whole that does the identifying.
And remember that the middle I in PII stands for Identifiable, and not, as many people presume, Identifying. To meet the definition of PII, data need not uniquely identify a person, it merely needs to be directly or indirectly identifiable with a person. And this is how it should be when we heed the way information technologies enable identification through linkages.
Almost anywhere else in the world, data stores like Snapchat's would automatically fall under data protection and information privacy laws; regulators would take a close look at whether the company had complied with the OECD Privacy Principles, and whether Snapchat's security measures were fit for purpose given the PII concerned. But in the USA, companies and commentators alike still have trouble working out how serious these breaches are. Each new breach is treated in an ad hoc manner, often with people finessing the difference between credit card numbers -- as in the recent Target breach -- and "mere" email addresses like those in the Snapchat and Epsilon episodes.
Surely the time has come to simply give proper regulatory protection to all PII.
Facebook's challenge to the Collection Limitation Principle
An extract from our chapter in the forthcoming Encyclopedia of Social Network Analysis and Mining (to be published by Springer in 2014).
Stephen Wilson, Lockstep Consulting, Sydney, Australia.
Anna Johnston, Salinger Privacy, Sydney, Australia.
- Facebook's business practices pose a risk of non-compliance with the Collection Limitation Principle (OECD Privacy Principle No. 1, and corresponding Australian National Privacy Principles NPP 1.1 through 1.4).
- Privacy problems will likely remain while Facebook's business model remains unsettled, for the business is largely based on collecting and creating as much Personal Information as it can, for subsequent and as yet unspecified monetization.
- If an OSN business doesn't know how it is eventually going to make money from Personal Information, then it has a fundamental difficulty with the Collection Limitation principle.
Facebook is an Internet and societal phenomenon. Launched in 2004, in just a few years it has claimed a significant proportion of the world's population as regular users, becoming by far the most dominant Online Social Network (OSN). With its success has come a good deal of controversy, especially over privacy. Does Facebook herald a true shift in privacy values? Or, despite occasional reckless revelations, are most users no more promiscuous than they were eight years ago? We argue it's too early to draw conclusions about society as a whole from the OSN experience to date. In fact, under laws that currently stand, many OSNs face a number of compliance risks in dozens of jurisdictions.
Over 80 countries worldwide now have enacted data privacy laws, around half of which are based on privacy principles articulated by the OECD. Amongst these are the Collection Limitation Principle which requires businesses to not gather more Personal Information than they need for the tasks at hand, and the Use Limitation Principle which dictates that Personal Information collected for one purpose not be arbitrarily used for others without consent.
Overt collection, covert collection (including generation) and "innovative" secondary use of Personal Information are the lifeblood of Facebook. While Facebook's founder would have us believe that social mores have changed, a clash with orthodox data privacy laws creates challenges for the OSN business model in general.
This article examines a number of areas of privacy compliance risk for Facebook. We focus on how Facebook collects Personal Information indirectly, through the import of members' email address books for "finding friends", and by photo tagging. Taking Australia's National Privacy Principles from the Privacy Act 1988 (Cth) as our guide, we identify a number of potential breaches of privacy law, and issues that may be generalised across all OECD-based privacy environments.
Australian law tends to use the term "Personal Information" rather than "Personally Identifiable Information" although they are essentially synonymous for our purposes.
Terms of reference: OECD Privacy Principles and Australian law
The Organisation for Economic Cooperation and Development has articulated eight privacy principles for helping to protect personal information. The OECD Privacy Principles are as follows:
- 1. Collection Limitation Principle
- 2. Data Quality Principle
- 3. Purpose Specification Principle
- 4. Use Limitation Principle
- 5. Security Safeguards Principle
- 6. Openness Principle
- 7. Individual Participation Principle
- 8. Accountability Principle
Of most interest to us here are principles one and four:
- Collection Limitation Principle: There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject.
- Use Limitation Principle: Personal data should not be disclosed, made available or otherwise used for purposes other than those specified in accordance with [the Purpose Specification] except with the consent of the data subject, or by the authority of law.
At least 89 counties have some sort of data protection legislation in place [Greenleaf, 2012]. Of these, in excess of 30 jurisdictions have derived their particular privacy regulations from the OECD principles. One example is Australia.
We will use Australia's National Privacy Principles NPPs in the Privacy Act 1988 as our terms of reference for analysing some of Facebook's systemic privacy issues. In Australia, Personal Information is defined as: information or an opinion (including information or an opinion forming part of a database), whether true or not, and whether recorded in a material form or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.
Indirect collection of contacts
One of the most significant collections of Personal Information by Facebook is surely the email address book of those members that elect to have the site help "find friends". This facility provides Facebook with a copy of all contacts from the address book of the member's nominated email account. It's the very first thing that a new user is invited to do when they register. Facebook refer to this as "contact import" in the Data Use Policy (accessed 10 August 2012).
"Find friends" is curtly described as "Search your email for friends already on Facebook". A link labelled "Learn more" in fine print leads to the following additional explanation:
- "Facebook won't share the email addresses you import with anyone, but we will store them on your behalf and may use them later to help others search for people or to generate friend suggestions for you and others. Depending on your email provider, addresses from your contacts list and mail folders may be imported. You should only import contacts from accounts you've set up for personal use." [underline added by us].
Without any further elaboration, new users are invited to enter their email address and password if they have a cloud based email account (such as Hotmail, gmail, Yahoo and the like). These types of services have an API through which any third party application can programmatically access the account, after presenting the user name and password.
It is entirely possible that casual users will not fully comprehend what is happening when they opt in to have Facebook "find friends". Further, there is no indication that, by default, imported contact details are shared with everyone. The underlined text in the passage quoted above shows Facebook reserves the right to use imported contacts to make direct approaches to people who might not even be members.
Importing contacts represents an indirect collection by Facebook of Personal Information of others, without their authorisation or even knowledge. The short explanatory information quoted above is not provided to the individuals whose details are imported and therefore does not constitute a Collection Notice. Furthermore, it leaves the door open for Facebook to use imported contacts for other, unspecified purposes. The Data Use Policy imposes no limitations as to how Facebook may make use of imported contacts.
Privacy harms are possible in social networking if members blur the distinction between work and private lives. Recent research has pointed to the risky use of Facebook by young doctors, involving inappropriate discussion of patients [Moubarak et al, 2010]. Even if doctors are discreet in their online chat, we are concerned that they may run foul of the Find Friends feature exposing their connections to named patients. Doctors on Facebook who happen to have patients in their web mail address books can have associations between individuals and their doctors become public. In mental health, sexual health, family planning, substance abuse and similar sensitive fields, naming patients could be catastrophic for them.
While most healthcare professionals may use a specific workplace email account which would not be amenable to contacts import, many allied health professionals, counselors, specialists and the like run their sole practices as small businesses, and naturally some will use low cost or free cloud-based email services. Note that the substance of a doctor's communications with their patients over web mail is not at issue here. The problem of exposing associations between patients and doctors arises simply from the presence of a name in an address book, even if the email was only ever used for non-clinical purposes such as appointments or marketing.
Photo tagging and biometric facial recognition
One of Facebook's most "innovative" forms of Personal Information Collection would have to be photo tagging and the creation of biometric facial recognition templates.
Photo tagging and "face matching" has been available in social media for some years now. On photo sharing sites such as Picasa, this technology "lets you organize your photos according to the people in them" in the words of the Picasa help pages. But in more complicated OSN settings, biometrics has enormous potential to both enhance the services on offer and to breach privacy.
In thinking about facial recognition, we start once more with the Collection Principle. Importantly, nothing in the Australian Privacy Act circumscribes the manner of collection; no matter how a data custodian comes to be in possession of Personal Information (being essentially any data about a person whose identity is apparent) they may be deemed to have collected it. When one Facebook member tags another in a photo on the site, then the result is that Facebook has overtly but indirectly collected PI about the tagged person.
Facial recognition technologies are deployed within Facebook to allow its servers to automatically make tag suggestions; in our view this process constitutes a new type of Personal Information Collection, on a potentially vast scale.
Biometric facial recognition works by processing image data to extract certain distinguishing features (like the separation of the eyes, nose, ears and so on) and computing a numerical data set known as a template that is highly specific to the face, though not necessarily unique. Facebook's online help indicates that they create templates from multiple tagged photos; if a user removes a tag from one of their photo, that image is not used in the template.
Facebook subsequently makes tag suggestions when a member views photos of their friends. They explain the process thus:
- "We are able to suggest that your friend tag you in a picture by scanning and comparing your friend‘s pictures to information we've put together from the other photos you've been tagged in".
So we see that Facebook must be more or less continuously checking images from members' photo albums against its store of facial recognition templates. When a match is detected, a tag suggestion is generated and logged, ready to be displayed next time the member is online.
What concerns us is that the proactive creation of biometric matches constitutes a new type of PI Collection, for Facebook must be attaching names -- even tentatively, as metadata -- to photos. This is a covert and indirect process.
Photos of anonymous strangers are not Personal Information, but metadata that identifies people in those photos most certainly is. Thus facial recognition is converting hitherto anonymous data -- uploaded in the past for personal reasons unrelated to photo tagging let alone covert identification -- into Personal Information.
Facebook limits the ability to tag photos to members who are friends of the target. This is purportedly a privacy enhancing feature, but unfortunately Facebook has nothing in its Data Use Policy to limit the use of the biometric data compiled through tagging. Restricting tagging to friends is likely to actually benefit Facebook for it reduces the number of specious or mischievous tags, and it probably enhances accuracy by having faces identified only by those who know the individuals.
A fundamental clash with the Collection Limitation Principle
In Australian privacy law, as with the OECD framework, the first and foremost privacy principle concerns Collection. Australia's National Privacy Principle NPP 1 requires that an organisation refrain from collecting Personal Information unless (a) there is a clear need to collect that information; (b) the collection is done by fair means, and (c) the individual concerned is made aware of the collection and the reasons for it.
The core business model of many Online Social Networks is to take advantage of Personal Information, in many and varied ways. From the outset, Facebook founder, Mark Zuckerberg, appears to have been enthusiastic for information built up in his system to be used by others. In 2004, he told a colleague "if you ever need info about anyone at Harvard, just ask" (as reported by Business Insider). Since then, Facebook has experienced a string of privacy controversies, including the "Beacon" sharing feature in 2007, which automatically imported members' activities on external websites and re-posted the information on Facebook for others to see.
Facebook's privacy missteps are characterised by the company using the data it collects in unforeseen and barely disclosed ways. Yet this is surely what Facebook's investors expect the company to be doing: innovating in the commercial exploitation of personal information. The company's huge market valuation derives from a widespread faith in the business community that Facebook will eventually generate huge revenues. An inherent clash with privacy arises from the fact that Facebook is a pure play information company: its only significant asset is the information it holds about its members. There is a market expectation that this asset will be monetized and maximised. Logically, anything that checks the network's flux in Personal Information -- such as the restraints inherent in privacy protection, whether adopted from within or imposed from without -- must affect the company's futures.
Perhaps the toughest privacy dilemma for innovation in commercial Online Social Networking is that these businesses still don't know how they are going to make money from their Personal Information lode. Even if they wanted to, they cannot tell what use they will eventually make of it, and so a fundamental clash with the Collection Limitation Principle remains.
An earlier version of this article was originally published by LexisNexis in the Privacy Law Bulletin (2010).
- Greenleaf G., "Global Data Privacy Laws: 89 Countries, and Accelerating", Privacy Laws & Business International Report, Issue 115, Special Supplement, February 2012 Queen Mary School of Law Legal Studies Research Paper No. 98/2012
- Moubarak G., Guiot A. et al "Facebook activity of residents and fellows and its impact on the doctor--patient relationship" J Med Ethics, 15 December 2010
The cover of Newsweek magazine on 27 July 1970 featured an innocent couple being menaced by cameras and microphones and new technologies like computer punch cards and paper tape. The headline hollered “IS PRIVACY DEAD?”.
The same question has been posed every few years ever since.
In 1999, Sun Microsystems boss Scott McNally urged us to “get over” the idea we have “zero privacy”; in 2008, Ed Giorgio from the Office of the US Director of National Intelligence chillingly asserted that “privacy and security are a zero-sum game”; Facebook’s Mark Zuckerberg proclaimed in 2010 that privacy was no longer a “social norm”. And now the scandal around secret surveillance programs like PRISM and the Five Eyes’ related activities looks like another fatal blow to privacy. But the fact that cynics, security zealots and information magnates have been asking the same rhetorical question for over 40 years suggests that the answer is No!
PRISM, as revealed by whistle blower Ed Snowden, is a Top Secret electronic surveillance program of the US National Security Agency (NSA) to monitor communications traversing most of the big Internet properties including, allegedly, Apple, Facebook, Google, Microsoft, Skype, Yahoo and YouTube. Relatedly, intelligence agencies have evidently also been obtaining comprehensive call records from major telephone companies, eavesdropping on international optic fibre cables, and breaking into the cryptography many take for granted online.
In response, forces lined up at tweet speed on both sides of the stereotypical security-privacy divide. The “hawks” say privacy is a luxury in these times of terror, if you've done nothing wrong you have nothing to fear from surveillance, and in any case, much of the citizenry evidently abrogates privacy in the way they take to social networking. On the other side, libertarians claim this indiscriminate surveillance is the stuff of the Stasi, and by destroying civil liberties, we let the terrorists win.
Governments of course are caught in the middle. President Obama defended PRISM on the basis that we cannot have 100% security and 100% privacy. Yet frankly that’s an almost trivial proposition. It's motherhood. And it doesn’t help to inform any measured response to the law enforcement challenge, for we don’t have any tools that would let us design a computer system to an agreed specification in the form of, say “98% Security + 93% Privacy”. It’s silly to us the language of “balance” when we cannot measure the competing interests objectively.
Politicians say we need a community debate over privacy and national security, and they’re right (if not fully conscientious in framing the debate themselves). Are we ready to engage with these issues in earnest? Will libertarians and hawks venture out of their respective corners in good faith, to explore this difficult space?
I suggest one of the difficulties is that all sides tend to confuse privacy for secrecy. They’re not the same thing.
Privacy is a state of affairs where those who have Personal Information (PII) about us are constrained in how they use it. In daily life, we have few absolute secrets, but plenty of personal details. Not many people wish to live their lives underground; on the contrary we actually want to be well known by others, so long as they respect what they know about us. Secrecy is a sufficient but not necessary condition for privacy. Robust privacy regulations mandate strict limits on what PII is collected, how it is used and re-used, and how it is shared.
Therefore I am a privacy optimist. Yes, obviously too much PII has broken the banks in cyberspace, yet it is not necessarily the case that any “genie” is “out of the bottle”.
If PII falls into someone’s hands, privacy and data protection legislation around the world provides strong protection against re-use. For instance, in Australia Google was found to have breached the Privacy Act when its StreetView cars recorded unencrypted Wi-Fi transmissions; the company cooperated in deleting the data concerned. In Europe, Facebook’s generation of tag suggestions without consent by biometric processes was ruled unlawful; regulators there forced Facebook to cease facial recognition and delete all old templates.
We might have a better national security debate if we more carefully distinguished privacy and secrecy.
I see no reason why Big Data should not be a legitimate tool for law enforcement. I have myself seen powerful analytical tools used soon after a terrorist attack to search out patterns in call records in the vicinity to reveal suspects. Until now, there has not been the technological capacity to use these tools pro-actively. But with sufficient smarts, raw data and computing power, it is surely a reasonable proposition that – with proper and transparent safeguards in place – population-wide communications metadata can be screened to reveal organised crimes in the making.
A more sophisticated and transparent government position might ask the public to give up a little secrecy in the interests of national security. The debate should not be polarised around the falsehood that security and privacy are at odds. Instead we should be debating and negotiating appropriate controls around selected metadata to enable effective intelligence gathering while precluding unexpected re-use. If (and only if) credible and verifiable safeguards can be maintained to contain the use and re-use of personal communications data, then so can our privacy.
For me the awful thing about PRISM is not that metadata is being mined; it’s that we weren’t told about it. Good governments should bring the citizenry into their confidence.
Are we prepared to honestly debate some awkward questions?
- Has the world really changed in the past 10 years such that surveillance is more necessary now? Should the traditional balances of societal security and individual liberties enshrined in our traditional legal structures be reviewed for a modern world?
- Has the Internet really changed the risk landscape, or is it just another communications mechanism. Is the Internet properly accommodated by centuries old constitutions?
- How can we have confidence in government authorities to contain their use of communications metadata? Is it possible for trustworthy new safeguards to be designed?
Many years ago, cryptographers adopted a policy of transparency. They have forsaken secret encryption algorithms, so that the maths behind these mission critical mechanisms is exposed to peer review and ongoing scrutiny. Secret algorithms are fragile in the long term because it’s only a matter of time before someone exposes them and weakens their effectiveness. Security professionals have a saying: “There is no security in obscurity”.
For precisely the same reason, we must not have secret government monitoring programs either. If the case is made that surveillance is a necessary evil, then it would actually be in everyone’s interests for governments to run their programs out in the open.