I gave a short speech at the launch of Australian Privacy Awareness Week #2013PAW on April 29. This is an edited version of my speaking notes.
What does privacy mean to technologists?
I'm a technologist who stumbled into privacy. Some 12 years ago I was doing a big security review at a utility company. Part of their policy document set was a privacy statement posted on the company's website. I was asked to check it out. It said things like 'We the company collect the following information about you [the customer] ... If you ever want a copy of the information we have about you, please call the Privacy Officer ...'. I had a hunch this was problematic, so I took it to the chief IT architect. He had never seen the statement before, and advised there was no way they could readily furnish complete customer details, for their CRM databases were all over the place.
Clearly there was a lot going on in privacy that we technologists needed to know. So with an inquiring mind, I read the Privacy Act. And I was amazed by what I found. In fact I wrote a paper in 2003 about the ramifications for IT of the 10 National Privacy Principles, and that kicked off my privacy sub-career.
Ever since I've found time and time again a shortfall in the understanding that "technologists" as a class have regarding data privacy. There is a gap between technology and the law. IT professionals may receive privacy training but as soon as they hear the well-meaning slogan "Privacy Is Not A Technology Issue" they tend to say 'thank god: that's one thing I don't need to worry about'. Conversely, privacy laws are written with some naivety about how information flows in modern IT and how it aggregates automatically in standard computer systems. For instance, several clauses in Australian privacy law refer expressly to making 'annotations' in the 'records' as if they're all paper based, with wide margins.
The gap is perpetuated to some extent by the popular impression that the law has not kept up with the march of technology. As a technologist, I have to say I am not cynical about the law; I actually find that principles-based data privacy law anticipates almost all of the current controversies in cyberspace (though not quite all, as we shall see).
So let's look at a couple of simple technicalities that technologists don't often comprehend.
What Privacy Law actually says
Firstly there is the very definition of Personal Information. Lay people and engineers tend to intuit that Personal Information [or equivalently what is known in the US as Personally Identifiable Information] is the stuff of forms and questionnaires and call centres. So technologists can be surprised that the definition of Personal Information covers a great deal more. Look at the definition from the Australian federal Privacy Act:
Information or an opinion, whether true or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.
So if metadata or event logs in a computer system are personally identifiable, then they constitute Personal Information, even if this data has been completely untouched by human hands.
Then there is the crucial matter of collection. Our privacy legislation like that of most OECD countries is technology neutral with regards to the manner of collection of Personal Information. Indeed, the term "collection" is not defined in the Privacy Act. The word is used in its plain English sense. So if Personal Information has wound up in an information system, it doesn't matter if it was gathered directly from the individual concerned, or whether it has instead been imported or found in the public domain or generated almost from scratch by some algorithm: the Personal Information has been collected and as such is covered by the Collection Principle of the Privacy Act. That is to say:
An organisation must not collect Personal Information unless the information is necessary for one or more of its functions or activities.
Editorial Note: One of the core differences between most international privacy law and the American environment is that there is no Collection Limitation in the Fair Information Practice Principles (FIPPs). The OECD approach tries to head privacy violations "off at the pass" by discouraging collection of PII if it is not expressly needed, but in the US business sector there is no such inhibition.
Now let's look at some of the missteps that have resulted from technologists accidentally overlooking these technicalities (or perhaps technocrats more deliberately ignoring them).
1. Google StreetView Wi-Fi collection
Google StreetView cars collect Wi-Fi hub coordinates (as landmarks for Google's geo-location services). On their own Wi-Fi locations are unidentified, but it was found that the StreetView software was also inadvertently collecting Wi-Fi network traffic, some of which contained Personal Information (like user names and even passwords). Australian and Dutch Privacy Commissioners found Google was in breach of respective data protection laws.
Many technologists I found argued that Wi-Fi data in the "public domain" is not private, and "by definition" (so they liked to say) it categorically could not be private. Therefore they believed Google was within its rights to do whatever it liked with such data. But the argument fails to grasp the technicality that our privacy laws basically do not distinguish public from "private". In fact the words "public" and "private" are not operable in the Privacy Act (which is really more of a data protection law). If data is identifiable, then privacy sanctions attach to it.
The lesson for Big Data privacy is this: it doesn't much matter if Personal Information is sourced from the public domain: you are still subject to Collection and Use Limitation principles (among others) once it is in your custody.
2. Facebook facial recognition
Facebook photo tagging creates biometric templates used to subsequently generate tag suggestions. Before displaying suggestions, Facebook's facial recognition algorithms run in the background over all photo albums. When they make a putative match and record a deduced name against a hitherto anonymous piece of image data, the Facebook system has collected Personal Information.
European privacy regulators in mid 2012 found biometric data collection without consent to be a serious breach, and by late 2012 had forced Facebook to shut down facial recognition and tag suggestions in the EU. This was quite a show of force over one of the most powerful companies of the digital age.
The lesson for Big Data privacy is this: it doesn't much matter if you generate Personal Information almost out of thin air, using sophisticated data processing algorithms: you are still subject to Privacy Principles, such as Openness as well as Collection and Use Limitation.
3. Target's pregnancy predictions
The department store Target in the US was found by New York Times investigative journalists to be experimenting with statistical methods for identifying that a regular customer is likely to be pregnant, by looking for trends in her buying habits. Retail strategists are keen to win the loyalty of pregnant women so as to secure their lucrative business through the expensive early years of parenting.
There are all sorts of issues here. One technicality I wish to draw out is that in Australia, the privacy implications would be amplified by the fact that tagging someone in a database as pregnant [even if that prediction is wrong!] creates health information, and therefore represents a collection of Sensitive Information. Express informed consent is required in advance of collecting Sensitive Information. So if Australian stores want to use Big Data techniques, they may need to disclose to their customers up front that health information might be extracted by mining their buying habits, and obtain express consent for the algorithms to run. Remember Australia sets a low bar for privacy breaches: simply collecting Sensitive Personal Information may be a breach even before it is used for anything or disclosed.
Note also there is already a latent problem in Australia for grocery stores that sell medicinals online, and this has nothing to do with Big Data. St Johns Wort for example may seem innocuous but it indicates that a customer has (or believes they have) depression. IT security managers might not have thought about the implications of logging mental health information in ordinary old web servers and databases.
4. "DNA Hacking"
In February this year, research was published where a subset of anonymous donors to a DNA research program in the UK were identified by cross-matching genes to data in US based public genealogy databases. All of a sudden, the ethics of re-identifying genetic material has become a red hot topic. Much attention is focusing on the nature of the informed consent; different initiatives (like the Personal Genome Project and 1,000 Genomes) give different levels of comfort about the possibility of re-identification. Absolute anonymity is typically disclaimed but donors in some projects are reassured that re-identification will be 'difficult'.
But regardless of the consent given by a Subject (1st party) to a researcher (2nd party), a nice legal problem arises when a separate 3rd party takes anonymous data and re-identifies it without consent. Technically the 3rd party has collected Personal Information, as per the principles discussed above, and that may require consent under privacy laws. Following on from the European facial recognition precedent, I contend that re-identification of DNA without consent is likely to be ruled problematic (if not unlawful) in some jurisdictions. And it therefore unethical in all fair minded jurisdictions.
Big Data's big challenge
So principles-based data protection laws have proven very powerful in the cases of Google's StreetView Wi-Fi collection and Facebook's facial recognition (even though these scenarios could not have been envisaged with any precision 30 years ago when the OECD privacy principles were formulated). And they seem to neatly govern DNA re-identification and data mining for health information, insofar as we can foresee how these activities may conflict with legislated principles and might therefore be brought to book. But there is one area where our data privacy principles may struggle to cope with Big Data: openness.
Orthodox privacy management involves telling individuals What information is collected about them, Why it is needed, When it is collected, and How. But with Big Data, even if a company wants to be completely transparent, it may not know what Personal Information lies waiting to be mined and discovered in the data, nor when exactly this discovery might be done.
An underlying theme in Big Data business models is data mining, or perhaps more accurately, data refining, as shown in the diagram here. An increasing array of data processing techniques are applied to vast stores of raw information (like image data in the example) to extract metadata and increasingly valuable knowledge.
There is nothing intrinsically wrong with a business model that extracts value from raw information, even if it converts anonymous data into Personal Information. But the privacy promise enshrined in OECD data protection laws – namely to be open with individuals about what is known about them and why – can become hard to honour.
There is a bargain at the heart of most social media companies today, in which Personal Information is traded for a rich array of free services. The bargain is opaque; the "infomopolies" are coy about the value they attach to the Personal Information of their members.
If Online Social Networks were more open about their business models, I think it likely that most of members would still be happy with the bargain. After all, Google, Facebook, Twitter et al have become indispensable for many of us. They do deliver fantastic value. But the Personal Information trade needs to be transparent.
"Big Privacy" Principles
In conclusion, I offer some expanded principles for protecting privacy in Big Data.
Exercise constraint: More than ever, remember that privacy is essentially about restraint. If a business knows me, then privacy means simply that the business is restrained in how it uses that knowledge.
Meta transparency: We're at the very start of the Big Data age. Who knows what lies ahead? Meta transparency means not only being open about what Personal Information is collected and why, but also being open about the business model and the emerging tools.
Engage customers in a fair value deal: Most savvy digital citizens appreciate there is no such thing as a free lunch; they already know at some level that "free" digital services are paid for by trading Personal Information. Many netizens have learned already to manage their own privacy in an ad hoc way, for instance obfuscating or manipulating the personal details they divulge. Ultimately consumers and businesses alike will do better by engaging in a real deal that sets out how PI is truly valued and leveraged.
- Re-identification of DNA may need ethics approval
- It's not too late for privacy
- Photo data as crude oil
- What stops Target telling you're pregnant?.
Many of the identerati campaign on Twitter and on the blogosphere for a federated new order, where banks in particular should be able to deal with new customers based on those customer’s previous registrations with other banks. Why, they ask, should a bank put you through all that identity proofing palava when you must have already passed muster at any number of banks before? Why can’t your new bank pick up the fact that you’ve been identified already? The plea to federate makes a lot of sense, but as I’ve argued previously, the devil is in the legals.
Funnily enough, a clue as to the nature of this problem is contained in the disclaimers on many of the identerati’s blogs and Twitter accounts:
"These are personal opinions only and do not reflect the position of my employer".
Come on. We all know that’s bullshit.
The bloggers I’m talking about are thought leaders at their employers. Many of them have written the book on identity. They're chairing the think tanks. What they say goes! So their blogs do in fact reflect very closely what their employers think.
So why the disclaimer? It's a legal technicality. A company’s lawyers do not want the firm held liable for the consequences of a random reader following an opinion provided outside the very tightly controlled conditions of a consulting contract; the lawyers do not want any remarks in a blog to be taken as advice.
And it's the same with federated identity. Accepting another bank's identification of an individual is something that cannot be done casually. Regardless of the common sense embodied in federated identity, the banks’ lawyers are saying to all institutions, sure, we know you're all putting customers through the same identity proofing protocols, but unless there is a contract in place, you must not rely on another bank's process; you have to do it yourself.
Now, there is a way to chip away at the tall walls of legal habit. This is going to sound a bit semantic, but we are talking about legal technicalities here, and semantics is the name of the game. Instead of Bank X representing to Bank Y that X can provide the "Identity" of a new customer, Bank X could provide a digitally notarised copy of some of the elements of the identity proofing. Elements could be provided as digitally signed messages saying "Here's a copy of Steve’s gas bill" or "Here's a copy of Steve’s birth certificate which we have previously verified". We could all stop messing around with abstract identities (which in the fine print mean different things to different Relying Parties) and instead drop down a level and exchange information about verified claims, or "identity assertions". Individual RPs could then pull together the elements of identity they need, add them up to an identification fit for their own purpose, and avoid the implications of having third parties "provide identity". The semantics would be easier if we only sought to provide elements of identity. All IdPs could be simplified and streamlined as Attribute Providers.
In my recent post "Identity is in the eye of the beholder" I tried to unpack the language of "identity provision". I argued that IdPs do not and cannot "provide identity" because identification is carried out by Relying Parties. It may seem like a sterile view in these days of '"self narrated' and bring-you-own identities but I think the truth is that identity is actually determined by Relying Parties. The state of being "identified" may be assisted (to a very great extent) by information provided by others including so-called "Identity" Providers but ultimately it is the RP that identifies me.
I note that the long standing dramaturgical analysis of social identity of Erving Goffman actually says the same thing, albeit in a softer way. That school of thought holds that identity is an emergent property, formed by the way we think others see us. In a social setting there are in effect many Relying Parties, all impressing upon us their sense of who we are. We reach an equilibrium over time, after negotiating all the different interrelating roles in the play of life. And the equilibrium can be starkly disrupted in what I've called the "High School Reunion Effect". So we do not actually curate our own identities with complete self-determination, but rather we allow our identities to be moulded dynamically to fit the expectations of those around us.
Now, in the digital realm, things are so much simpler, you might even say more elegant in an engineering fashion. I'd like to think that the dramaturgical frame sets a precedent for having identities impressed upon us. We should not take offense at this, and we should temper what we mean by "user centric" identities: it need not mean freely expressing all of our identities.
For more precision, maybe it would be useful to get into the habit of specifying the context whenever we talk of a Digital Identity. So here's a bit of mathematical nomenclature, but don't worry, it's not strenuous!
Let's designate the identification performed by a Relying Party RP on a Subject S as IRP-S.
If the RP has drawn on information provided by an "Identity Provider" (running with the dominant language for now), then we can write the identification as a function of the IdP:
Identification = IRP-S(IdP)
But it is still true that the state of identification is reached by the RP and not the IdP.
We can generalise from this to imagine Relying Parties using more than one IdP in making the identification of a subject:
Identification = IRP-S(IdP1,IdP2)
And then we could take things one step further, to recognise that the distinction between "identity providers" and "attribute providers" is arbitrary. So the most general formulation would show identification being a function of a number of attributes verified by the RP either for itself or on its behalf by external attribute providers:
Identification = IRP-S(A1,A2,...,A2)
(where the source of the attribute information could be indicated in various ways).
The work we're trying to start in Australia on a Claims Verification ecosystem reflects this kind of thinking -- it may be more powerful and more practicable to have RPs assemble their knowledge of Subjects from a variety of sources.
"Here's your cool new identifier! It's so easy to use. No passwords to forget, no cards to leave behind. And it's so high tech, you're going to be able to use it for everything eventually: payments, banking, e-health, the Interwebs, unlocking your house or office, starting your car!
"Oh, one thing though, there are some little clues about your identifier around the place. Some clues are on the ATM, some clues are in Facebook and others in Siri. There may be a few in your trash. But it's nothing to worry about. It's hard for hackers to decipher the clues. Really quite hard.
"What's that you say? What if some hacker does figure out the puzzle? Gosh, um, we're not exactly sure, but we got some guys doing their PhDs on that issue. Sorry? Will we give you a new identifier in the meantime? Well, no actually, we can't do that right now. Ok, no other questions? Cool!
Posted in Biometrics
No it doesn't, it only means the end of anonymity.
Anonymity is not the same thing as privacy. Anonymity keeps people from knowing what you're doing, and it's a vitally important quality in many settings. But in general we usually want people (at least some people) to know what we're up to, so long as they respect that knowledge. That's what privacy is all about. Anonymity is a terribly blunt instrument for protecting privacy, and it's also fragile. If anonymity was all you have, then you're in deep trouble when someone manages to defeat it.
New information technologies have clearly made anonymity more difficult, yet it does not follow that we must lose our privacy. Instead, these developments bring into stark relief the need for stronger regulatory controls that compel restraint in the way third parties deal with Personal Information that comes into their possession.
A great example is Facebook's use of facial recognition. When Facebook members innocently tag one another in photos, Facebook creates biometric templates with which it then automatically processes all photo data (previously anonymous), looking for matches. This is how they can create tag suggestions, but Facebook is notoriously silent on what other applications it has for facial recognition. Now and then we get a hint, with, for example, news of the Facedeals start up last year. Facedeals accesses Facebook's templates (under conditions that remain unclear) and uses them to spot customers as they enter a store to automatically check them in. It's classic social technology: kinda sexy, kinda creepy, but clearly in breach of Collection, Use and Disclosure privacy principles.
And indeed, European regulators have found that Facebook's facial recognition program is unlawful. The chief problem is that Facebook never properly disclosed to members what goes on when they tag one another, and they never sought consent to create biometric templates with which to subsequently identify people throughout their vast image stockpiles. Facebook has been forced to shut down their facial recognition operations in Europe, and they've destroyed their historical biometric data.
So privacy regulators in many parts of the world have real teeth. They have proven that re-identification of anonymous data by facial recognition is unlawful, and they have managed to stop a very big and powerful company from doing it.
This is how we should look at the implications of the DNA 'hacking'. Indeed, Melissa Gymrek from the Whitehead Institute said in an interview: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously."
Perhaps this episode will bring even more attention to the problem in the USA, and further embolden regulators to enact broader privacy protections there. Perhaps the very extremeness of the DNA hacking does not spell the end of privacy so much as its beginning.
Biometrics seems to be going gang busters in the developing world. I fear we're seeing a new wave of technological imperialism. In this post I will examine whether the biometrics field is mature enough for the lofty social goal of empowering the world's poor and disadvantaged with "identity".
The independent Center for Global Development has released a report "Identification for Development: The Biometrics Revolution" which looks at 160 different identity programs using biometric technologies. By and large, it's a study of the vital social benefits to poor and disadvantaged peoples when they gain an official identity and are able to participate more fully in their countries and their markets.
The CGD report covers some of the kinks in how biometrics work in the real world, like the fact that a minority of people can be unable to enroll and they need to be subsequently treated carefully and fairly. But I feel the report takes biometric technology for granted. In contrast, independent experts have shown there is insufficient science for biometric performance to be predicted in the field. I conclude biometrics are not ready to support such major public policy initiatives as ID systems.
The state of the science of biometrics
I recently came across a weighty assessment of the science of biometrics presented by one of the gurus, Jim Wayman, and his colleagues to the NIST IBPC 2010 biometric testing conference. The paper entitled "Fundamental issues in biometric performance testing: A modern statistical and philosophical framework for uncertainty assessment" should be required reading for all biometrics planners and pundits.
Here are some important extracts:
[Technology] testing on artificial or simulated databases tells us only about the performance of a software package on that data. There is nothing in a technology test that can validate the simulated data as a proxy for the “real world”, beyond a comparison to the real world data actually available. In other words, technology testing on simulated data cannot logically serve as a proxy for software performance over large, unseen, operational datasets. [p15, emphasis added].
In a scenario test, [False Non Match Rate and False Match Rate] are given as rates averaged over total transactions. The transactions often involve multiple data samples taken of multiple persons at multiple times. So influence quantities extend to sampling conditions, persons sampled and time of sampling. These quantities are not repeatable across tests in the same lab or across labs, so measurands will be neither repeatable nor reproducible. We lack metrics for assessing the expected variability of these quantities between tests and models for converting that variability to uncertainty in measurands.[p17].
To explain, a biometric "technology test" is when a software package is exercised on a standardised data set, usually in a bake-off such as NIST's own biometric performance tests over the years. And a "scenario test" is when the biometric system is tested in the lab using actual test subjects. The meaning of the two dense sentences underlined by me in the extracts is: technology test results from one data set do not predict performance on any other data set or scenario, and biometrics practitioners still have no way to predict the accuracy of their solutions in the real world.
The authors go on:
[To] report false match and false non-match performance metrics for [iris and face recognition] without reporting on the percentage of data subjects wearing contact lenses, the period of time between collection of the compared image sets, the commercial systems used in the collection process, pupil dilation, and lighting direction is to report "nothing at all". [pp17-18].
And they conclude, amongst other things:
[False positive and false negative] measurements have historically proved to be neither reproducible nor repeatable except in very limited cases of repeated execution of the same software package against a static database on the same equipment. Accordingly, "technology" test metrics have not aligned well with "scenario" test metrics, which have in turn failed to adequately predict field performance. [p22].
The limitations of biometric testing has repeatedly been stressed by no less an authority than the US FBI. In their State-of-the-Art Biometric Excellence Roadmap (SABER) Report the FBI cautions that:
For all biometric technologies, error rates are highly dependent upon the population and application environment. The technologies do not have known error rates outside of a controlled test environment. Therefore, any reference to error rates applies only to the test in question and should not be used to predict performance in a different application. [p4.10]
The SABER report also highlighted a widespread weakness in biometric testing, namely that accuracy measurements usually only look at accidental errors:
The intentional spoofing or manipulation of biometrics invalidates the “zero effort imposter” assumption commonly used in performance evaluations. When a dedicated effort is applied toward fooling biometrics systems, the resulting performance can be dramatically different. [p1.4]
A few years ago, the Future of Identity in the Information Society Consortium ("FIDIS", a research network funded by the European Community’s Sixth Framework Program) wrote a major report on forensics and identity systems. FIDIS looked at the spoofability of many biometrics modalities in great detail (pp 28-69). These experts concluded:
Concluding, it is evident that the current state of the art of biometric devices leaves much to be desired. A major deficit in the security that the devices offer is the absence of effective liveness detection. At this time, the devices tested require human supervision to be sure that no fake biometric is used to pass the system. This, however, negates some of the benefits these technologies potentially offer, such as high-throughput automated access control and remote authentication. [p69]
Biometrics in public policy
To me, biometrics is in an appalling and astounding state of affairs. The prevailing public understanding of how these technologies work is utopian, based probably on nothing more than science fiction movies, and the myth of biometric uniqueness. In stark contrast, scientists warn there is no telling how biometrics will work in the field, and the FBI warns that bench testing doesn't predict resistance to attack. It's very much like the manufacturer of a safe confessing to a bank manager they don't know how it will stand up in an actual burglary.
This situation has bedeviled enterprise and financial services security for years. Without anyone admitting it, it's possible that the slow uptake of biometrics in retail and banking (save for Japan and their odd hand vein ATMs) is a result of hard headed security officers backing off when they look deep into the tech. But biometrics is going gang busters in the developing world, with vendors thrilling to this much bigger and faster moving market.
The stakes are so very high in national ID systems, especially in the developing world, where resistance to their introduction is relatively low, for various reasons. I'm afraid there is great potential for technological imperialism, given the historical opacity of this industry and its reluctance to engage with the issues.
To be sure vendors are not taking unfair advantage of the developing world ID market, they need to answer some questions:
- Firstly, how do they respond to Jim Wayman, the FIDIS Consortium and the FBI? Is it possible to predict how finger print readers, face recognition and iris scanners are going to operate, over years and years, in remote and rural areas?
- In particular, how good is liveness detection? Can these solutions be trusted in unattended operation for such critical missions as e-voting?
- What contingency plans are in place for biometric ID theft? Can the biometric be cancelled and reissued if compromised? Wouldn't it be catastrophic for the newly empowered identity holder to find themselves cut out of the system if their biometric can no longer be trusted?
I had a letter published in Science magazine about the recently publicised re-identification of anonymously donated DNA data. It has been shown that there is enough named genetic information online, in genealogical databases for instance, that anonymous DNA posted in research databases can be re-identified. This is a sobering result indeed. But does it mean that 'privacy is dead'?
No. The fact is that re-identification of erstwhile anonymous data represents an act of collection of PII and is subject to the Collection Limitation Principle in privacy law around the world. This is essentially the same scenario as Facebook using biometric facial recognition to identify people in photos. European regulators recently found Facebook to have breached privacy law and have forced Facebook to shut down their facial recognition feature.
I expect that the very same legal powers will permit regulators to sanction the re-identification of DNA. There are legal constraints on what can be done with 'anonymous' data no matter where you get it from: under some data privacy laws, attaching names to such data constitutes a Collection of PII, and as such, is subject to consent rules and all sorts of other principles. As a result, bioinformatics researchers will have to tread carefully, justifying their ends and their means before ethics committees. And corporations who seek to exploit the ability to put names on anonymous genetic data may face the force of the law as Facebook did.
To summarise: Let's assume Subject S donates their DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S as belonging to the DNA sample. The fact that many commentators seem oblivious to is this: R2 has Collected Personal Information (or PII) about S. If R2 has no relationship with S, then S has not consented to this new collection of her PII. In jurisdictions with strict Collection Limitation (like the EU, Australia and elsewhere) then it seems to me to be a legal privacy breach for R2 to collect PII by way of DNA re-identification without express consent, regardless of whether R1 has conceded to S that it might happen. Even in the US, where the protections might not be so strict, there remains a question of ethics: should R2 conduct themselves in a manner that might be unlawful in other places?
The text of my letter to Science follows, and after that, I'll keep posting follow ups.
Science 8 February 2013:
Vol. 339 no. 6120 pp. 647
Yaniv Erlich at the Whitehead Institute for Biomedical Research used his hacking skills to decipher the names of anonymous DNA donors ("Genealogy databases enable naming of anonymous DNA donor," J. Bohannon, 18 January, p. 262). A little-known legal technicality in international data privacy laws could curb the privacy threats of reverse identification from genomes. "Personal information" is usually defined as any data relating to an individual whose identity is readily apparent from the data. The OECD Privacy Principles are enacted in over 80 countries worldwide . Privacy Principle No. 1 states: "There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject." The principle is neutral regarding the manner of collection. Personal information may be collected directly from an individual or indirectly from third parties, or it may be synthesized from other sources, as with "data mining."
Computer scientists and engineers often don't know that recording a person's name against erstwhile anonymous data is technically an act of collection. Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the information later signifies a new collection. The new collection of personal information requires its own consent; the original disclaimer does not apply when third parties take data and process it beyond the original purpose for collection. Educating those with this capability about the legal meaning of collection should restrain the misuse of DNA data, at least in those jurisdictions that strive to enforce the OECD principles.
It also implies that bioinformaticians working "with little more than the Internet" to attach names to samples may need ethics approval, just as they would if they were taking fresh samples from the people concerned.
Lockstep Consulting Pty Ltd
Five Dock Sydney, NSW 2046, Australia.
In an interview with Science Magazine on Jan 18, the Whitehead Institute's Melissa Gymrek discussed the re-identification methods, and the potential to protect against them. She concluded: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously.".
I agree completely. We need regulations. Elsewhere I've argued that anonymity is an inadequate way to protect privacy, and that we need a balance of regulations and Privacy Enhancing Technologies. And it's for this reason that I am not fatalistic about the fact that anonymity can be broken, because we have the procedural means to see that privacy is still preserved.
Last week I had the very great pleasure of participating in the first MIT Legal Hackathon, organised by Dazza Greenwood and Thomas Hardjono for the MIT Media Lab, Kerberos Consortium and wwPass. I say first because they plan to hold a monthly hangout! I hope and expect that this will become a strong, dynamic new forum for multi-disciplined explorations of Digital Identity.
In Dazza's wrap-up of the event, he pondered the potential for "open public infrastructure for identity":
... like a big bus of some sort for essential claims from public or other sources, utilised foundationally for identity functions."
His idea builds out logically from a proposed system of claims verification services that I presented to the hackathon, and blogged about a few weeks ago. So for discussion, here's a further development of the schematic. A variety claims verification services would be made available over a common bus as Dazza suggested, and used by a Relying Party to assemble the particular fractions of information they decide will make up a Subject's identity in a given transaction context.
Something I really I like about this architecture is that it supports several different modes of identification. For one, it could be used in real time by an RP faced with a fresh user for the first time; the RP could in real time seek out 'attribute providers' in the OIX or Identity Metasystem way of working. Alternatively, for well-worn e-commerce transactions where the necessary claims are well known in advance, the Subject could put together a basket of claims in advance and carry them in an identity wallet to be presented directly to the RP.
The diagram also shows a visualisation of the claims of interest to the RP for the transaction at hand, and the necessary degree of confidence i each of them (i.e. 90% in name, residential address and date of birth). I discussed this way of looking at different claims sets as surfaces in another blog last year.
As we rethink identity orthodoxies in forums like the MIT Legal Hackathon, I propose we shift perspectives a little. For instance:
- We should drop down a level, and focus on ways to exchange information about elements of identity, rather than rolled-up "identities" themselves; that is, we should fractionate identity into its important component parts, guided by transaction context.
- When building identification services frameworks, we should avoid imposing particular business protocols on organisations, so they remain free to select which claims and combinations of claims they want Subjects to exhibit.
- We can avoid technicalities like the difference between "authentication" and "authorization", and indeed we can remove ourselves from the philosophical debates over "identity"; the proposal simply provides uniform market-based mechanisms for parties to assert and test elemental claims as a precursor to doing business.
- Life looks much simpler under the neutral definition of "authentication" adopted by the APEC eSecurity Task Group over a decade ago: the means by which a receiver of an electronic transaction or message makes a decision to accept or reject that transaction or message.
None of this is actually radical. We've always thought about claims and attributes, all the authentication protocols deal with attributes, the good old Laws of Identity were actually all about claims, and there is infrastructure to deal with claims.
I think we just need to shift focus. We technologists shouldn't be so preoccupied with identity per se; let businesses continue to sort out identities as they see fit, and just give them the means to deal digitally with component claims. Let's not put IdPs ahead of APs. It may turn out we don't need IdPs at all. It's all about the claims, and only about the claims.
That is to say, identity is in the eye of the Relying Party.
The word "identity" seems increasingly problematic to me. It's full of contradictions. On the one hand, it's a popular view that online identity should be "user centric"; many commentators call for users to be given greater determination in how they are identified. People like the idea of "narrating" their own identities, and "bringing their own identity" to work. Yet it's not obvious how governments, banks, healthcare providers or employers for instance can grant people much meaningful say in how they are identified. These sorts of organisations impress their particular forms of identity upon us in order to formalise the relationship they have with us and manage our access to services.
The language of orthodox Federated Identity institutionalises the idea that identity is a good that is "provided" to us through a supply formal chain elaborated in architectures like the Open Identity Exchange (OIX). It might make sense in some settings for individuals to exercise a choice of IdPs, for example choosing between Facebook or Twitter to log on to a social website, but users still don't have much influence on how the IdPs operate, nor on the decision made by Relying Parties about which IdPs they elect to recognise. Think about the choice we have of credit cards: you might prefer to use Diners Club over MasterCard, but if you're shopping at a place that doesn't accept Diners, your "choice" is constrained. You cannot negotiate in real time to have the store accept your chosen instrument (instead you can choose to get yourself a MasterCard or you can choose to go to a different store).
I think the concept of "identity" is so fluid that we should probably stop using it. Or at least use it with much more self-conscious precision.
I'd like you to consider that "Identity Providers" do not in fact provide identity. They really can't provide identity at all, but only assertions -- that is, elements of identity -- that are put together by others who are impacted by the validity of those elements. The act of identification is a part of risk management. It means getting to know a Subject so as to make certain risks more manageable. And it's always done by a Relying Party.
An identity is the outcome of an identification process in which claims about a Subject are verified, to the satisfaction of the Relying Party. An "identity" is basically a handle by which the Subject is known. Recall that the Laws of Identity usefully defined a Digital Identity as a set of claims about the Digital Subject. And we all know that identity is highly context dependent; on its own, an identity like "Acct No. 12345678" means little or nothing without knowing the context as well.
This line of reasoning reminds me once again of the technology neutral, functional definition of "authentication" used by the APEC eSecurity Task Group over a decade ago: the means by which a receiver of an electronic transaction or message makes a decision to accept or reject that transaction or message. Wouldn't life be so much simpler if we stopped overloading some bits of authentication knowledge with the label "identity" and going to such lengths to differentiate other bits of knowledge as "attributes"? What we need online is better means for reliably conveying precise pieces of information about each other, relevant to the transaction at hand. That's all.
Carefully unpacking the language of identity management, we see that no Identity Provider ever actually "identifies" people. In realty, identification is always done by Relying Parties by pulling together what they need to know about a Subject for their own purposes. One IdP might say "This is Steve Wilson", another "This is Stephen Kevin Wilson", another "This is @Steve_Lockstep", another "This is Stephen Wilson, CEO of Lockstep" and yet another "This is Stephen Wilson at 100 Park Ave Jonestown Visa 4000 1234 5678 9012". None of these assertions are my "identity"! My "identity" is different at every RP, each to their need.
See also An Algebra of Identity.
I have come to believe that a systemic conceptual shortfall affects typical technologists' thinking about privacy. It may be that engineers tend to take literally the well-meaning slogan that "privacy is not a technology issue". I say this in all seriousness.
Online, we're talking about data privacy, or data protection, but systems designers tend to bring to work a spectrum of personal outlooks about privacy in the human sphere. Yet what matters is the precise wording of data privacy law, like Australia's Privacy Act. To illustrate the difference, here's the sort of experience I've had time and time again.
During the course of conducting a PIA in 2011, I spent time with the development team working on a new government database. These were good, senior people, with sophisticated understanding of information architecture. But they harboured restrictive views about privacy. An important clue was the way they referred to "private" information rather than Personal Information (or equivalently, Personally Identifiable Information, PII). After explaining that Personal Information is the operable term in Australian legislation, and reviewing its definition from the Privacy Act, we found that the team had failed to appreciate the extent of the PI in their system. They overlooked that most of their audit logs collect PI, albeit indirectly and automatically. Further, they had not appreciated that information about clients in their register provided by third parties was also PI (despite it being intuitively "less private" by virtue of originating from others). I attributed these blind spots to the developers' weak and informal frame of "private" information. Online and in data privacy law alike, things are very crisp. The definition of Personal Information -- namely any data relating to an individual whose identity is readily apparent -- sets a low bar, embracing a great many data classes and, by extension, informatics processes. It's a nice analytical definition that is readily factored into systems analysis. After the team got that, the PIA in question proceeded apace and we found and rectified several privacy risks that had gone unnoticed.
Here are some more of the many recurring misconceptions I've noticed over the past decade:
- "Personal" Information is sometimes taken to mean especially delicate information such as payment card details, rather than any information pertaining to an identifiable individual such as email addresses in many cases; an exchange between US data breach analyst Jake Kouns and me over the Epsilon incident in 2011 is revealing of a technologists' systemically narrow idea of PII;
- the act of collecting PI is sometimes regarded only in relation to direct collection from the individual concerned; technologists can overlook that PI provided by a third party to a data custodian is nevertheless being collected by the custodian, and they can fail to appreciate that generating PI internally, through event logging for instance, can also represent collection
- even if they are aware of points such as Australia's Access and Correction Principle, database administrators can be unaware that, technically, individuals requesting a copy of information held about them should also be provided with pertinent event logs; a non-trivial case where individuals can have a genuine interest in reviewing event logs is when they want to know if an organisation's staff have been accessing their records.
These instances, among many others in my experience working across both information security and privacy, show that ICT practitioners suffer important gaps in their understanding. Security professionals in particular may be forgiven for thinking that most legislated Privacy Principles are legal niceties irrelevant to them, for generally only one of the principles in any given set is overtly about security; see:
- no. 5 of the eight OECD Privacy Principles
- no. 4 of the five Fair Information Practice Principles in the US
- no. 8 of the ten Generally Accepted Privacy Principles of the US and Canadian accounting bodies,
- no. 4 of the ten old National Privacy Principles of Australia, and
- no. 11 of the 13 new Australian Privacy Principles (APPs).
Yet every one of the privacy principles is impacted by information technology and security practices; see Mapping Privacy requirements onto the IT function, Privacy Law & Policy Reporter, Vol. 10.1& 10.2, 2003. I believe the gaps in the privacy knowledge of ICT practitioners are not random but are systemic, probably resulting from privacy training for non-privacy professionals being ad hoc and not properly integrated with their particular world views.
To properly deal with data privacy, ICT practitioners need to have privacy framed in a way that leads to objective design requirements. Luckily there already exist several unifying frameworks for systematising the work of dev teams. One example that resonates strongly with data privacy practice is the Threat & Risk Assessment (TRA).
The TRA is an infosec requirements analysis tool, widely practiced in the public and private sectors. There are a number of standards that guide the conduct of TRAs, such as ISO 31000. A TRA is used to systematically catalogue all foreseeable adverse events that threaten an organisation's information assets, identify candidate security controls (considering technologies, processes and personnel) to mitigate those threats, and most importantly, determine how much should be invested in each control to bring all risks down to an acceptable level. The TRA process delivers real world management decisions, understanding that non zero risks are ever present, and that no organisation has an unlimited security budget.
I have found that in practice, the TRA exercise is readily extensible as an aid to Privacy by Design. A TRA can expressly incorporate privacy as an attribute of information assets worth protecting, alongside the conventional security qualities of confidentiality, integrity and availability ("C.I.A."). A crucial subtlety here is that privacy is not the same as confidentiality, yet many frequently conflate the two. A fuller understanding of privacy leads designers to consider the Collection, Use, Disclosure and Access & Correction principles, over and above confidentiality when they analyse information assets.
Lockstep continues to actively research the closer integration of security and privacy practices.