A big part of my research agenda in the Digital Safety theme at Constellation is privacy. And what a vexed topic it is! It's hard to even know how to talk about privacy. For many years, folks have covered privacy in more or less academic terms, drawing on sociology, politics and pop psychology, joining privacy to human rights, and crafting new various legal models.
Meanwhile the data breaches get worse, and most businesses have just bumped along.
When you think about it, it’s obvious really: there’s no such thing as perfect privacy. The real question is not about ‘fundamental human rights’ versus business, but rather, how can we optimise a swarm of competing interests around the value of information?
Privacy is emerging as one of the most critical and strategic of our information assets. If we treat privacy as an asset, instead of a burden, businesses can start to cut through this tough topic.
But here’s an urgent issue. A recent regulatory development means privacy may just stop a lot of business getting done. It's the European Court of Justice decision to shut down the US-EU Safe Harbor arrangement.
The privacy Safe Harbor was a work-around negotiated by the Federal Trade Commission, allowing companies to send personal data from Europe into the US.
But the Safe Harbor is no more. It's been ruled unlawful. So it’s a big, big problem for European operations, many multinationals, and especially US cloud service providers.
At Constellation we've researched cloud geography and previously identified competitive opportunities for service providers to differentiate and compete on privacy. But now this is an urgent issue.
It's time American businesses stopped getting caught out by global privacy rulings. There shouldn't be too many surprises here, if you understand what data protection means internationally. Even the infamous "Right To Be Forgotten" ruling on Google’s search engine – which strikes so many technologists as counter intuitive – was a rational and even predictable outcome of decades old data privacy law.
The leading edge of privacy is all about Big Data. And we aint seen nothin yet!
Look at artificial intelligence, Watson Health, intelligent personal assistants, hackable cars, and the Internet of Everything where everything is instrumented, and you see information assets multiplying exponentially. Privacy is actually just one part of this. It’s another dimension of information, one that can add value, but not in a neat linear way. The interplay of privacy, utility, usability, efficiency, efficacy, security, scalability and so on is incredibly complex.
The broader issue is Digital Safety: safety for your customers, and safety for your business.
A new effort dubbed Project Enigma "guarantees" us privacy, by way of a certain technology. Never mind that Enigma's "magic" (their words) comes from the blockchain and that it's riddled with assumptions; the very idea of technology-based perfection in privacy is profoundly misguided.
Enigma is not alone; the vast majority of 'Privacy Enhancing Technologies' (PETs) are in fact secrecy or anonymity solutions. Anonymity is a blunt and fragile tool for privacy; in the event that encryption for instance is broken, you still need the rule of law to stem abuse. I wonder why people still conflate privacy and anonymity? Plainly, privacy is the protection you need when your affairs are not secret.
In any event, few people need or want to live underground. We actually want merchants and institutions and employers and doctors to know us in reasonable detail, but we insist they exercise restraint in what they do with that knowledge.
Consider a utopian architecture where things could be made totally secret between you and a correspondent. How would you choose to share something with more than one party, like a health record, or a party invitation? How would you delegate someone to share something with others on your behalf? How would you withdraw permissions? How would it work in a heterogeneous IT environment? And above all, how would you control all the personal information created about you behind your back, unseen, beyond your reach?
Privacy is about restraint. It's less about what we do with someone’s personal information than what we don’t do. So it’s more political than technological. Privacy can only really be managed through rules. Of course rules and enforcement are imperfect, but let’s not be utopian about privacy. Just as there is no such thing as absolute security, there is no perfect privacy either.
Posted in Privacy
For 35 years now, a body of data protection jurisprudence has been built on top of the original OECD Privacy Principles. The most elaborate and energetically enforced privacy regulations are in Europe (although well over 100 countries have privacy laws at last count). By and large, the European privacy regime is welcome by the roughly 700 million citizens whose interests it protects.
Over the years, this legal machinery has produced results that occasionally surprise the rest of the world. Among these was the "Right To Be Forgotten", a ruling of the European Court of Justice (ECJ) which requires web search operators in some cases to block material that is inaccurate, irrelevant or excessive. And this week, the ECJ determined that the U.S. "Safe Harbor" arrangement (a set of pragmatic work-arounds that have permitted the import of personal information from Europe by American companies) is invalid.
These strike me as entirely logical outcomes of established technology-neutral privacy law. The Right To Be Forgotten simply treats search results as synthetic personal information, collected algorithmically, and applies regular privacy principles: if a business collects personal information, then lawful limits apply no matter how it's collected. And the self-regulated Safe Harbor was found to not provide the strength of safeguards that Europeans have come to expect. Its inadequacies are old news; action by the court has been a long time coming.
In parallel with steadily developing privacy law, an online business ecosystem has evolved, centred on the U.S. and based on the limitless resource that is information. Fabulous products, services and unprecedented economic success have flowed. But the digital rush (like gold and oil rushes before it) has brought calamity. A shaken American populace, subject to daily breaches, spying and exploitation, is left wondering who and what will ever keep them safe in cyberspace.
So it's honestly a mystery to me why every European privacy advance is met with such reflexive condemnation in America.
The OECD Privacy Principles safeguard individuals by controlling the flow of information about them. In the decades since the principles were framed, digital technologies and business models have radically expanded how information is created and how it moves. Personal information is now produced as if by magic (by wizards who make billions by their tricks). But the basic privacy principles are steadfastly the same, and are manifestly more important than ever. You know, that's what good laws are like.
A huge proportion of the American public would cheer for better data protection. We all know they deserve it. If American institutions had a better track record of respecting and protecting the data commons, then they'd be entitled to bluster about European privacy. But as things stand in Silicon Valley and Washington, moral outrage should be directed at the businesses and governments who sit on their hands over data breaches and surveillance, instead of those who do something about it.
Posted in Privacy
The identerati sometimes refer to the challenge of “binding carbon to silicon”. That’s a poetic way of describing how the field of Identity and Access Management (IDAM) is concerned with associating carbon-based life forms (as geeks fondly refer to people) with computers (or silicon chips).
To securely bind users’ identities or attributes to their computerised activities is indeed a technical challenge. In most conventional IDAM systems, there is only circumstantial evidence of who did what and when, in the form of access logs and audit trails, most of which can be tampered with or counterfeited by a sufficiently determined fraudster. To create a lasting, tamper-resistant impression of what people do online requires some sophisticated technology (in particular, digital signatures created using hardware-based cryptography).
On the other hand, working out looser associations between people and computers is the stock-in-trade of social networking operators and Big Data analysts. So many signals are emitted as a side effect of routine information processing today that even the shyest of users may be uncovered by third parties with sufficient analytics know-how and access to data.
So privacy is in peril. For the past two years, big data breaches have only got bigger: witness the losses at Target (110 million), EBay (145 million), Home Depot (109 million records) and JPMorgan Chase (83 million) to name a few. Breaches have got deeper, too. Most notably, in June 2015 the U.S. federal government’s Office of Personnel Management (OPM) revealed it had been hacked, with the loss of detailed background profiles on 15 million past and present employees.
I see a terrible systemic weakness in the standard practice of information security. Look at the OPM breach: what was going on that led to application forms for employees dating back 15 years remaining in a database accessible from the Internet? What was the real need for this availability? Instead of relying on firewalls and access policies to protect valuable data from attack, enterprises need to review which data needs to be online at all.
We urgently need to reduce the exposed attack surface of our information assets. But in the information age, the default has become to make data as available as possible. This liberality is driven both by the convenience of having all possible data on hand, just in case in it might be handy one day, and by the plummeting cost of mass storage. But it's also the result of a technocratic culture that knows "knowledge is power," and gorges on data.
In communications theory, Metcalfe’s Law states that the value of a network is proportional to the square of the number of devices that are connected. This is an objective mathematical reality, but technocrats have transformed it into a moral imperative. Many think it axiomatic that good things come automatically from inter-connection and information sharing; that is, the more connection the better. Openness is an unexamined rallying call for both technology and society. “Publicness” advocate Jeff Jarvis wrote (admittedly provocatively) that: “The more public society is, the safer it is”. And so a sort of forced promiscuity is shaping up as the norm on the Internet of Things. We can call it "superconnectivity", with a nod to the special state of matter where electrical resistance drops to zero.
In thinking about privacy on the IoT, a key question is this: how much of the data emitted from Internet-enabled devices will actually be personal data? If great care is not taken in the design of these systems, the unfortunate answer will be most of it.
My latest investigation into IoT privacy uses the example of the Internet connected motor car. "Rationing Identity on the Internet of Things" will be released soon by Constellation Research.
And don't forget Constellation's annual innovation summit, Connected Enterprise at Half Moon Bay outside San Francisco, November 4th-6th. Early bird registration closes soon.
The Biometrics Institute has received Australian government assistance to fund the next stage of the development of a new privacy Trust Mark. And Lockstep Consulting is again working with the Institute to bring this privacy initiative to fruition.
A detailed feasibility study was undertaken by Lockstep in the first half of 2015, involving numerous privacy advocates, regulators and vendors in Europe, the US, New Zealand and Australia.
We found strong demand for a reputable, non-trivial B2C biometrics certification.
Privacy advocates are generally supportive of a new Trust Mark, however they stress that a Trust Mark can be counter-productive if it is too easy to obtain, biased by industry interests, and/or poorly policed. There is general agreement that a credible trust mark should be non-trivial, and consequently, that the criteria be reasonably prescriptive. The reality of a strong Trust Mark is that not all architectures and solution instances will be compatible with the certification criteria.
The next stage of the Biometrics Institute project will deliver technical criteria for the award of the Trust Mark, and a PIA (Privacy Impact Assessment) template. A condition of the Trust Mark will be that a PIA is undertaken.
Please contact Steve Wilson at Lockstep firstname.lastname@example.org or Isabelle Moeller (Biometrics Institute CEO) email@example.com, if you'd like to receive further details of the Stage 1 findings, or would like to contribute to the technical research in Stage 2.
An unpublished letter to New Yorker magazine, August 2015.
Kelefa Sanneh ("The Hell You Say", Aug 10 & 17) poses a question close to the heart of society’s analog-to-digital conversion: What is speech?
Internet policy makers worldwide are struggling with a recent European Court of Justice decision which grants some rights to individuals to have search engines like Google block results that are inaccurate, irrelevant or out of date. Colloquially known as the "Right To Be Forgotten" (RTBF), the ruling has raised the ire of many Americans in particular, who typically frame it as yet another attack on free speech. Better defined as a right to be de-listed, RTBF makes search providers consider the impact on individuals of search algorithms, alongside their commercial interests. For there should be no doubt – search is very big business. Google and its competitors use search to get to know people, so they can sell better advertising.
Search results are categorically not the sort of text which contributes to "democratic deliberation". Free speech may be many things but surely not the mechanical by-products of advertising processes. To protect search results as such mocks the First Amendment.
Some of my other RTBF thoughts:
- Search is not a passive reproduction; Google makes the public domain public.
- Google's deeply divided Advisory Council was strangely silent on the business nature of search.
- Search results are a special form of Big Data, and not the sort of thing that counts as speech.
On July 23, BlackBerry hosted its second annual Security Summit, once again in New York City. As with last year’s event, this was a relatively intimate gathering of analysts and IT journalists, brought together for the lowdown on BlackBerry’s security and privacy vision.
By his own account, CEO John Chen has met plenty of scepticism over his diverse and, some say, chaotic product and services portfolio. And yet it’s beginning to make sense. There is a strong credible thread running through Chen’s initiatives. It all has to do with the Internet of Things.
Disclosure: I traveled to the Blackberry Security Summit as a guest of Blackberry, which covered my transport and accommodation.
The Growth Continues
In 2014, John Chen opened the show with the announcement he was buying the German voice encryption firm Secusmart. That acquisition appears to have gone well for all concerned; they say nobody has left the new organisation in the 12 months since. News of BlackBerry’s latest purchase - of crisis communications platform AtHoc - broke a few days before this year’s Summit, and it was only the most recent addition to the family. In the past 12 months, BlackBerry has been busy spending $150M on inorganic growth, picking up:
Chen has also overseen an additional $100M expenditure in the same timeframe on organic security expansion (over and above baseline product development). Amongst other things BlackBerry has:
The Growth Explained - Secure Mobile Communications
Executives from different business units and different technology horizontals all organised their presentations around what is now a comprehensive security product and services matrix. It looks like this (before adding AtHoc):
BlackBerry is striving to lead in Secure Mobile Communications. In that context the highlights of the Security Summit for mine were as follows.
The Internet of Things
BlackBerry’s special play is in the Internet of Things. It’s the consistent theme that runs through all their security investments, because as COO Marty Beard says, IoT involves a lot more than machine-to-machine communications. It’s more about how to extract meaningful data from unbelievable numbers of devices, with security and privacy. That is, IoT for BlackBerry is really a security-as-a-service play.
Chief Security Officer David Kleidermacher repeatedly stressed the looming challenge of “how to patch and upgrade devices at scale”.
- MyPOV: Functional upgrades for smart devices will of course be part and parcel of IoT, but at the same time, we need to work much harder to significantly reduce the need for reactive security patches. I foresee an angry consumer revolt if things that never were computers start to behave and fail like computers. A radically higher standard of quality and reliability is required. Just look at the Jeep Uconnect debacle, where it appears Chrysler eventually thought better of foisting a patch on car owners and instead opted for a much more expensive vehicle recall. It was BlackBerry’s commitment to ultra high reliability software that really caught my attention at the 2014 Security Summit, and it convinces me they grasp what’s going to be required to make ubiquitous computing properly seamless.
Refreshingly, COO Beard preferred to talk about economic value of the IoT, rather than the bazillions of devices we are all getting a little jaded about. He said the IoT would bring about $4 trillion of required technology within a decade, and that the global economic impact could be $11 trillion.
BlackBerry’s real time operating system QNX is in 50 million cars today.
AtHoc is a secure crisis communications service, with its roots in the first responder environment. It’s used by three million U.S. government workers today, and the company is now pushing into healthcare.
Founder and CEO Guy Miasnik explained that emergency communications involves more than just outbound alerts to people dealing with disasters. Critical to crisis management is the secure inbound collection of info from remote users. AtHoc is also not just about data transmission (as important as that is) but it works also at the application layer, enabling sophisticated workflow management. This allows procedures for example to be defined for certain events, guiding sets of users and devices through expected responses, escalating issues if things don’t get done as expected.
We heard more about BlackBerry’s collaboration with Oxford University on the Centre for High Assurance Computing Excellence, first announced in April at the RSA Conference. CHACE is concerned with a range of fundamental topics, including formal methods for verifying program correctness (an objective that resonates with BlackBerry’s secure operating system division QNX) and new security certification methodologies, with technical approaches based on the Common Criteria of ISO 15408 but with more agile administration to reduce that standard’s overhead and infamous rigidity.
CSO Kleidermacher announced that CHACE will work with the Diabetes Technology Society on a new healthcare security standards initiative. The need for improved medical device security was brought home vividly by an enthralling live demonstration of hacking a hospital drug infusion pump. These vulnerabilities have been exposed before at hacker conferences but BlackBerry’s demo was especially clear and informative, and crafted for a non-technical executive audience.
- MyPOV: The message needs to be broadcast loud and clear: there are life-critical machines in widespread use, built on commercial computing platforms, without any careful thought for security. It’s a shameful and intolerable situation.
I was impressed by BlackBerry’s privacy line. It's broader and more sophisticated than most security companies, going way beyond the obvious matters of encryption and VPNs. In particular, the firm champions identity plurality. For instance, WorkLife by BlackBerry, powered by Movirtu technology, realizes multiple identities on a single phone. BlackBerry is promoting this capability in the health sector especially, where there is rarely a clean separation of work and life for professionals. Chen said he wants to “separate work and private life”.
The health sector in general is one of the company’s two biggest business development priorities (the other being automotive). In addition to sophisticated telephony like virtual SIMs, they plan to extend extend AtHoc into healthcare messaging, and have tasked the CHACE think-tank with medical device security. These actions complement BlackBerry’s fine words about privacy.
So BlackBerry’s acquisition plan has gelled. It now has perhaps the best secure real time OS for smart devices, a hardened device-independent Mobile Device Management backbone, new data-centric privacy and rights management technology, remote certificate management, and multi-layered emergency communications services that can be diffused into mission-critical rules-based e-health settings and, eventually, automated M2M messaging. It’s a powerful portfolio that makes strong sense in the Internet of Things.
BlackBerry says IoT is 'much more than device-to-device'. It’s more important to be able to manage secure data being ejected from ubiquitous devices in enormous volumes, and to service those things – and their users – seamlessly. For BlackBerry, the Internet of Things is really all about the service.
The Australian government is to revamp the troubled Personally Controlled Electronic Health Record (PCEHR). In line with the Royle Review from Dec 2013, it is reported that patient participation is to change from the current Opt-In model to Opt-Out; see "Govt to make e-health records opt-out" by Paris Cowan, IT News.
That is to say, patient data from hospitals, general practice, pathology and pharmacy will be added by default to a central longitudinal health record, unless patients take steps (yet to be specified) to disable sharing.
The main reason for switching the consent model is simply to increase the take-up rate. But it's a much bigger change than many seem to realise.
The government is asking the community to trust it to hold essentially all medical records. Are the PCEHR's security and privacy safeguards up to scratch to take on this grave responsibility? I argue the answer is no, on two grounds.
Firstly there is the practical matter of PCEHR's security performance to date. It's not good, based on publicly available information. On multiple occasions, prescription details have been uploaded from community pharmacy to the wrong patient's records. There have been a few excuses made for this error, with blame sheeted home to the pharmacy. But from a system's perspective -- and health care is all about the systems -- you cannot pass the buck like that. Pharmacists are using a PCEHR system that was purportedly designed for them. And it was subject to system-wide threat & risk assessments that informed the architecture and design of not just the electronic records system but also the patient and healthcare provider identification modules. How can it be that the PCEHR allows such basic errors to occur?
Secondly and really fundamentally, you simply cannot invert the consent model as if it's a switch in the software. The privacy approach is deep in the DNA of the system. Not only must PCEHR security be demonstrably better than experience suggests, but it must be properly built in, not retrofitted.
Let me explain how the consent approach crops up deep in the architecture of something like PCEHR. During analysis and design, threat & risk assessments (TRAs) and privacy impact assessments (PIAs) are undertaken, to identify things that can go wrong, and to specify security and privacy controls. These controls generally comprise a mix of technology, policy and process mechanisms. For example, if there is a risk of patient data being sent to the wrong person or system, that risk can be mitigated a number of ways, including authentication, user interface design, encryption, contracts (that obligate receivers to act responsibly), and provider and patient information. The latter are important because, as we all should know, there is no such thing as perfect security. Mistakes are bound to happen.
One of the most fundamental privacy controls is participation. Individuals usually have the ultimate option of staying away from an information system if they (or their advocates) are not satisfied with the security and privacy arrangements. Now, these are complex matters to evaluate, and it's always best to assume that patients do not in fact have a complete understanding of the intricacies, the pros and cons, and the net risks. People need time and resources to come to grips with e-health records, so a default opt-in affords them that breathing space. And it errs on the side of caution, by requiring a conscious decision to participate. In stark contrast, a default opt-out policy embodies a position that the scheme operator believes it knows best, and is prepared to make the decision to participate on behalf of all individuals.
Such a position strikes many as beyond the pale, just on principle. But if opt-out is the adopted policy position, then clearly it has to be based on a risk assessment where the pros indisputably out-weigh the cons. And this is where making a late switch to opt-out is unconscionable.
You see, in an opt-in system, during analysis and design, whenever a risk is identified that cannot be managed down to negligible levels by way of technology and process, the ultimate safety net is that people don't need to use the PCEHR. It is a formal risk management ploy (a part of the risk manager's toolkit) to sometimes fall back on the opt-in policy. In an opt-in system, patients sign an agreement in which they accept some risk. And the whole security design is predicated on that.
Look at the most recent PIA done on the PCEHR in 2011; section 9.1.6 "Proposed solutions - legislation" makes it clear that opt-in participation is core to the existing architecture. The PIA makes a "critical legislative recommendation" including:
- a number of measures to confirm and support the 'opt in' nature of the PCEHR for consumers (Recommendations 4.1 to 4.3) [and] preventing any extension of the scope of the system, or any change to the 'opt in' nature of the PCEHR.
The PIA at section 2.2 also stresses that a "key design feature of the PCEHR System ... is opt in – if a consumer or healthcare provider wants to participate, they need to register with the system." And that the PCEHR is "not compulsory – both consumers and healthcare providers choose whether or not to participate".
A PDF copy of the PIA report, which was publicly available at the Dept of Health website for a few years after 2011, is archived here.
The fact is that if the government changes the PCEHR from opt-in to opt-out, it will invalidate the security and privacy assessments done to date. The PIAs and TRAs will have to be repeated, and the project must be prepared for major redesign.
The Royle Review report (PDF) did in fact recommend "a technical assessment and change management plan for an opt-out model ..." (Recommendation 14) but I am not aware that such a review has taken place.
To look at the seriousness of this another way, think about "Privacy by Design", the philosophy that's being steadily adopted across government. In 2014 NEHTA wrote in a submission (PDF) to the Australian Privacy Commissioner:
- The principle that entities should employ “privacy by design” by building privacy into their processes, systems, products and initiatives at the design stage is strongly supported by NEHTA. The early consideration of privacy in any endeavour ensures that the end product is not only compliant but meets the expectations of stakeholders.
One of the tenets of Privacy by Design is that you cannot bolt on privacy after a design is done. Privacy must be designed into the fabric of any system from the outset. All the way along, PCEHR has assumed opt-in, and the last PIA enshrined that position.
If the government was to ignore its own Privacy by Design credo, and not revisit the PCEHR architecture, it would be an amazing breach of the public's trust in the healthcare system.
Ray Wang tells us now that writing a book and launching a company are incredibly fulfilling things to do - but ideally, not at the same time. He thought it would take a year to write "Disrupting Digital Business", but since it overlapped with building Constellation Research, it took three! But at the same time, his book is all the richer for that experience.
Ray is on a world-wide book tour (tweeting under the hash tag #cxotour). I was thrilled to participate in the Melbourne leg last week. We convened a dinner at Melbourne restaurant The Deck and were joined by a good cross section of Australian private and public sector businesses. There were current and recent executives from Energy Australia, Rio Tinto, the Victorian Government and Australia Post among others, plus the founders of several exciting local start-ups. And we were lucky to have special guests Brian Katz and Ben Robbins - two renowned mobility gurus.
The format for all the launch events has one or two topical short speeches from Constellation analysts and Associates, and a fireside chat by Ray. In Melbourne, we were joined by two of Australia's deep digital economy experts, Gavin Heaton and Joanne Jacobs. Gavin got us going on the night, surveying the importance of innovation, and the double edged opportunities and threats of digital disruption.
Then Ray spoke off-the-cuff about his book, summarising years of technology research and analysis, and the a great many cases of business disruption, old and new. Ray has an encyclopedic grasp of tech-driven successes and failures going back decades, yet his presentations are always up-to-the-minute and full of practical can-do calls to action. He's hugely engaging, and having him on a small stage for a change lets him have a real conversation with the audience.
Speaking with no notes and PowerPoint-free, Ray ranged across all sorts of disruptions in all sorts of sectors, including:
- Sony's double cassette Walkman (which Ray argues playfully was their "last innovation")
- Coca Cola going digital, and the speculative "ten cent sip"
- the real lesson of the iPhone: geeks spend time arguing about whether Apple's technology is original or appropriated, when the point is their phone disrupted 20 or more other business models
- the contrasting Boeing 787 Dreamliner and Airbus A380 mega jumbo - radically different ways to maximise the one thing that matters to airlines: dollars per passenger-miles, and
- Uber, which observers don't always fully comprehend as a rich mix of mobility, cloud and Big Data.
And I closed the scheduled part of the evening with a provocation on privacy. I asked the group to think about what it means to call any online business practice "creepy". Have community norms and standards really changed in the move online? What's worse: government surveillance for political ends, or private sector surveillance for profit? If we pay for free online services with our personal information, do regular consumers understand the bargain? And if cynics have been asking "Is Privacy Dead?" for over 100 years, doesn't it mean the question is purely rhetorical? Who amongst us truly wants privacy to be over?!
The discussion quickly attained a life of its own - muscular, but civilized. And it provided ample proof that whatever you think about privacy, it is complicated and surprising, and definitely disruptive! (For people who want to dig further into the paradoxes of modern digital privacy, Ray and I recently recorded a nice long chat about it).
Here are some of the Digital Disruption tour dates coming up:
Acknowledgement: Daniel Barth-Jones kindly engaged with me after this blog was initially published, and pointed out several significant factual errors, for which I am grateful.
In 2014, the New York Taxi & Limousine Company (TLC) released a large "anonymised" dataset containing 173 million taxi rides taken in 2013. Soon after, software developer Vijay Pandurangan managed to undo the hashed taxi registration numbers. Subsequently, privacy researcher Anthony Tockar went on to combine public photos of celebrities getting in or out of cabs, to recreate their trips. See Anna Johnston's analysis here.
This re-identification demonstration has been used by some to bolster a general claim that anonymity online is increasingly impossible.
On the other hand, medical research advocates like Columbia University epidemiologist Daniel Barth-Jones argue that the practice of de-identification can be robust and should not be dismissed as impractical on the basis of demonstrations such as this. The identifiability of celebrities in these sorts of datasets is a statistical anomaly reasons Barth-Jones and should not be used to frighten regular people out of participating in medical research on anonymised data. He wrote in a blog that:
- "However, it would hopefully be clear that examining a miniscule proportion of cases from a population of 173 million rides couldn’t possibly form any meaningful basis of evidence for broad assertions about the risks that taxi-riders might face from such a data release (at least with the taxi medallion/license data removed as will now be the practice for FOIL request data)."
As a health researcher, Barth-Jones is understandably worried that re-identification of small proportions of special cases is being used to exaggerate the risks to ordinary people. He says that the HIPAA de-identification protocols if properly applied leave no significant risk of re-id. But even if that's the case, HIPAA processes are not applied to data across the board. The TLC data was described as "de-identified" and the fact that any people at all (even stand-out celebrities) could be re-identified from data does create a broad basis for concern - "de-identified" is not what it seems. Barth-Jones stresses that in the TLC case, the de-identification was fatally flawed [technically: it's no use hashing data like registration numbers with limited value ranges because the hashed values can be reversed by brute force] but my point is this: who among us who can tell the difference between poorly de-identified and "properly" de-identified?
And how long can "properly de-identified" last? What does it mean to say casually that only a "minuscule proportion" of data can be re-identified? In this case, the re-identification of celebrities was helped by the fact lots of photos of them are readily available on social media, yet there are so many photos in the public domain now, regular people are going to get easier to be identified.
But my purpose here is not to play what-if games, and I know Daniel advocates statistically rigorous measures of identifiability. We agree on that -- in fact, over the years, we have agreed on most things. The point I am trying to make in this blog post is that, just as nobody should exaggerate the risk of re-identification, nor should anyone play it down. Claims of de-identification are made almost daily for really crucial datasets, like compulsorily retained metadata, public health data, biometric templates, social media activity used for advertising, and web searches. Some of these claims are made with statistical rigor, using formal standards like the HIPAA protocols; but other times the claim is casual, made with no qualification, with the aim of comforting end users.
"De-identified" is a helluva promise to make, with far-reaching ramifications. Daniel says de-identification researchers use the term with caution, knowing there are technical qualifications around the finite probability of individuals remaining identifiable. But my position is that the fine print doesn't translate to the general public who only hear that a database is "anonymous". So I am afraid the term "de-identified" is meaningless outside academia, and in casual use is misleading.
Barth-Jones objects to the conclusion that "it's virtually impossible to anonymise large data sets" but in an absolute sense, that claim is surely true. If any proportion of people in a dataset may be identified, then that data set is plainly not "anonymous". Moreover, as statistics and mathematical techniques (like facial recognition) improve, and as more ancillary datasets (like social media photos) become accessible, the proportion of individuals who may be re-identified will keep going up.[Readers who wish to pursue these matters further should look at the recent Harvard Law School online symposium on "Re-identification Demonstrations", hosted by Michelle Meyer, in which Daniel Barth-Jones and I participated, among many others.]
Both sides of this vexed debate need more nuance. Privacy advocates have no wish to quell medical research per se, nor do they call for absolute privacy guarantees, but we do seek full disclosure of the risks, so that the cost-benefit equation is understood by all. One of the obvious lessons in all this is that "anonymous" or "de-identified" on their own are poor descriptions. We need tools that meaningfully describe the probability of re-identification. If statisticians and medical researchers take "de-identified" to mean "there is an acceptably small probability, namely X percent, of identification" then let's have that fine print. Absent the detail, lay people can be forgiven for thinking re-identification isn't going to happen. Period.
And we need policy and regulatory mechanisms to curb inappropriate re-identification. Anonymity is a brittle, essentially temporary, and inadequate privacy tool.
I argue that the act of re-identification ought to be treated as an act of Algorithmic Collection of PII, and regulated as just another type of collection, albeit an indirect one. If a statistical process results in a person's name being added to a hitherto anonymous record in a database, it is as if the data custodian went to a third party and asked them "do you know the name of the person this record is about?". The fact that the data custodian was clever enough to avoid having to ask anyone about the identity of people in the re-identified dataset does not alter the privacy responsibilities arising. If the effect of an action is to convert anonymous data into personally identifiable information (PII), then that action collects PII. And in most places around the world, any collection of PII automatically falls under privacy regulations.
It looks like we will never guarantee anonymity, but the good news is that for privacy, we don't actually need to. Privacy is the protection you need when you affairs are not anonymous, for privacy is a regulated state where organisations that have knowledge about you are restrained in what they do with it. Equally, the ability to de-anonymise should be restricted in accordance with orthodox privacy regulations. If a party chooses to re-identify people in an ostensibly de-identified dataset, without a good reason and without consent, then that party may be in breach of data privacy laws, just as they would be if they collected the same PII by conventional means like questionnaires or surveillance.
Surely we can all agree that re-identification demonstrations serve to shine a light on the comforting claims made by governments for instance that certain citizen datasets can be anonymised. In Australia, the government is now implementing telecommunications metadata retention laws, in the interests of national security; the metadata we are told is de-identified and "secure". In the UK, the National Health Service plans to make de-identified patient data available to researchers. Whatever the merits of data mining in diverse fields like law enforcement and medical research, my point is that any government's claims of anonymisation must be treated critically (if not skeptically), and subjected to strenuous and ongoing privacy impact assessment.
Privacy, like security, can never be perfect. Privacy advocates must avoid giving the impression that they seek unrealistic guarantees of anonymity. There must be more to privacy than identity obscuration (to use a more technically correct term than "de-identification"). Medical research should proceed on the basis of reasonable risks being taken in return for beneficial outcomes, with strong sanctions against abuses including unwarranted re-identification. And then there wouldn't need to be a moral panic over re-identification if and when it does occur, because anonymity, while highly desirable, is not essential for privacy in any case.