Big Data? Big Privacy!

I was invited to give a speech to launch Australian Privacy Awareness Week #2013PAW on April 29. This is an edited version of my speaking notes.

What does privacy mean to technologists?

I’m a technologist who stumbled into privacy. Some 12 years ago I was doing a big security review at a utility company. Part of their policy document set was a privacy statement posted on the company’s website. I was asked to check it out. It said things like ‘We collect the following information about you [the customer] … If you ever want a copy of the information we have about you, please call the Privacy Officer …’. I had a hunch this was problematic, so I took the document to the chief IT architect. He had never seen the privacy statement before, so that was the first problem. Moreover, he advised there was no way they could readily furnish complete customer details, for their CRM databases were all over the place. So IT was disenfranchised in the privacy statement, and the undertakings it contained were impractical.

Clearly there was a lot going on in privacy that we technologists needed to know. So with an inquiring mind, I took it upon myself to read the Privacy Act. And I was amazed by what I found. In fact I wrote a paper in 2003 about the ramifications for IT of the 10 National Privacy Principles, and that kicked off my privacy sub-career.

Ever since I’ve found time and time again a shortfall in the understanding that “technologists” as a class have regarding data privacy. There is a gap between technology and the law. IT professionals may receive privacy training but as soon as they hear the well-meaning slogan “Privacy Is Not A Technology Issue” they tend to say ‘thank god: that’s one thing I don’t need to worry about’. Conversely, privacy laws are written with some naivety about how information flows in modern IT and how it aggregates automatically in standard computer systems. For instance, several clauses in Australian privacy law refer expressly to making ‘annotations’ in the ‘records’ as if they’re all paper based, with wide margins.

The gap is perpetuated to some extent by the popular impression that the law has not kept up with the march of technology. As a technologist, I have to say I am not cynical about the law; I actually find that principles-based data privacy law anticipates almost all of the current controversies in cyberspace (though not quite all, as we shall see).

So let’s look at a couple of simple technicalities that technologists don’t often comprehend.

What Privacy Law actually says

Firstly there is the very definition of Personal Information. Lay people and engineers tend to intuit that Personal Information [or equivalently what is known in the US as Personally Identifiable Information] is the stuff of forms and questionnaires and call centres. So technologists can be surprised that the definition of Personal Information covers a great deal more. Look at the definition from the Australian federal Privacy Act:

Information or an opinion, whether true or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.

So if metadata or event logs in a computer system are personally identifiable, then they constitute Personal Information, even if this data has been completely untouched by human hands.

Then there is the crucial matter of collection. Our privacy legislation like that of most OECD countries is technology neutral with regards to the manner of collection of Personal Information. Indeed, the term “collection” is not defined in the Privacy Act. The word is used in its plain English sense. So if Personal Information has wound up in an information system, it doesn’t matter if it was gathered directly from the individual concerned, or whether it has instead been imported or found in the public domain or generated almost from scratch by some algorithm: the Personal Information has been collected and as such is covered by the Collection Principle of the Privacy Act. That is to say:

An organisation must not collect Personal Information unless the information is necessary for one or more of its functions or activities.

Editorial Note: One of the core differences between most international privacy law and the American environment is that there is no Collection Limitation in the Fair Information Practice Principles (FIPPs). The OECD approach tries to head privacy violations “off at the pass” by discouraging collection of PII if it is not expressly needed, but in the US business sector there is no such inhibition.

Now let’s look at some of the missteps that have resulted from technologists accidentally overlooking these technicalities (or perhaps technocrats more deliberately ignoring them).

1. Google StreetView Wi-Fi collection

Google StreetView cars collect Wi-Fi hub coordinates (as landmarks for Google’s geo-location services). On their own Wi-Fi locations are unidentified, but it was found that the StreetView software was also inadvertently collecting Wi-Fi network traffic, some of which contained Personal Information (like user names and even passwords). Australian and Dutch Privacy Commissioners found Google was in breach of respective data protection laws.

Many technologists I found argued that Wi-Fi data in the “public domain” is not private, and “by definition” (so they liked to say) it categorically could not be private. Therefore they believed Google was within its rights to do whatever it liked with such data. But the argument fails to grasp the technicality that our privacy laws basically do not distinguish public from “private”. In fact the words “public” and “private” are not operable in the Privacy Act (which is really more of a data protection law). If data is identifiable, then privacy sanctions attach to it.

The lesson for Big Data privacy is this: it doesn’t much matter if Personal Information is sourced from the public domain: you are still subject to Collection and Use Limitation principles (among others) once it is in your custody.

2. Facebook facial recognition

Facebook photo tagging creates biometric templates used to subsequently generate tag suggestions. Before displaying suggestions, Facebook’s facial recognition algorithms run in the background over all photo albums. When they make a putative match and record a deduced name against a hitherto anonymous piece of image data, the Facebook system has collected Personal Information (albeit indirectly).

European privacy regulators in mid 2012 found biometric data collection without consent to be a serious breach, and by late 2012 had forced Facebook to shut down facial recognition and tag suggestions in the EU. This was quite a show of force over one of the most powerful companies of the digital age.

The lesson for Big Data privacy is this: it doesn’t much matter if you generate Personal Information almost out of thin air, using sophisticated data processing algorithms: you are still subject to Privacy Principles, such as Openness as well as Collection Limitation and Use Limitation. You cannot use Big Data to synthesise Personal Information behind peoples’ backs without a good cause, just as you cannot directly ask customers spurious questions in a form.

3. Target’s pregnancy predictions

In 2012, the US department store Target was revealed by the New York Times to be developing statistical methods for identifying when regular customers become pregnant, by looking for trends in buying habits. Preferences for scented products and household chemicals shift subtly in ways that can reveal pregnancy, possibly even before the woman herself is aware of it. Retail strategists are keen to win the loyalty of pregnant women so as to secure lucrative business through the expensive early years of child rearing.

There are all sorts of issues here, and the matter is still playing out in the US. A crucial technicality I want to focus on for now is that in Australia, tagging someone in a database as pregnant (even if that prediction is wrong!) creates health information, which falls under the legislated category of Sensitive Information. Express informed consent is required in advance of collecting any Sensitive Information. So if Australian stores want to use Big Data techniques, they may need to disclose to their customers up front that health information might be extracted by mining their buying habits, and obtain express consent for the algorithms to run. Remember Australia sets a low bar for privacy breaches: simply logging Sensitive Personal Information may be a breach, even before that data it is used for anything.

Irrespective of Big Data, there has been a longstanding snag in Australia for grocery stores selling herbal remedies and over-the-counter medicines online. Buying habits in some cases can indicate health conditions and therefore represents clear-cut health information which is being routinely collected without consent and without the stores being aware of the implications. St Johns Wort for example may seem innocuous but it is only purchased by people who have (or believe they have) depression. IT security managers might not have thought about the implications of logging mental health information in ordinary old web servers and databases.

4. “DNA Hacking”

In February this year, research was published where a subset of anonymous donors to a DNA research program in the UK were identified by cross-matching genes to data in US based public genealogy databases. All of a sudden, the ethics of re-identifying genetic material has become a red hot topic. Much attention is focusing on the nature of the informed consent; different initiatives (like the Personal Genome Project and 1,000 Genomes) give different levels of comfort about the possibility of re-identification. Absolute anonymity is typically disclaimed but donors in some projects are reassured that re-identification will be ‘difficult’.

But regardless of the consent given by a Subject (1st party) to a researcher (2nd party), a nice legal problem arises when a separate 3rd party takes anonymous data and re-identifies it without consent. Technically the 3rd party has collected Personal Information, as per the principles discussed above, and that may require consent under privacy laws. Following on from the European facial recognition precedent, I contend that re-identification of DNA without consent is likely to be ruled problematic (if not unlawful) in some jurisdictions. And it therefore unethical in all fair minded jurisdictions.

Big Data’s big challenge

So principles-based data protection laws have proven very powerful in the cases of Google’s StreetView Wi-Fi collection and Facebook’s facial recognition (even though these scenarios could not have been envisaged with any precision 30 years ago when the OECD privacy principles were formulated). And they seem to neatly govern DNA re-identification and data mining for health information, insofar as we can foresee how these activities may conflict with legislated principles and might therefore be brought to book. But there is one area where our data privacy principles may struggle to cope with Big Data: openness.

Orthodox privacy management involves telling individuals What information is collected about them, Why it is needed, When it is collected, and How. But with Big Data, even if a company wants to be completely transparent, it may not know what Personal Information lies waiting to be mined and discovered in the data, nor when exactly this discovery might be done.

An underlying theme in Big Data business models is data mining, or perhaps more accurately, data refining, as suggested by this diagram:

An increasing array of data processing techniques are applied to vast stores of raw information (like image data in the example) to extract metadata and increasingly valuable knowledge.

There is nothing intrinsically wrong with a business model that extracts value from raw information, even if it converts anonymous data into Personal Information. But the privacy promise enshrined in OECD data protection laws – namely to be open with individuals about what is known about them and why – can become hard to honour.

There is a bargain at the heart of most social media companies today, in which Personal Information is traded for a rich array of free services. The bargain is opaque; the “infomopolies” are coy about the value they attach to the Personal Information of their members.

If Online Social Networks were more open about their business models, I think it likely that most of members would still be happy with the bargain. After all, Google, Facebook, Twitter et al have become indispensable for many of us. They do deliver fantastic value. But the Personal Information trade needs to be transparent.

“Big Privacy” Principles

In conclusion, I offer some expanded principles for protecting privacy in Big Data.

Exercise constraint: More than ever, remember that privacy is essentially about restraint. If a business knows me, then privacy means simply that the business is restrained in how it uses that knowledge.

Meta transparency: We’re at the very start of the Big Data age. Who knows what lies ahead? Meta transparency means not only being open about what Personal Information is collected and why, but also being open about the business model and the emerging tools.

Engage customers in a fair value deal: Most savvy digital citizens appreciate there is no such thing as a free lunch; they already know at some level that “free” digital services are paid for by trading Personal Information. Many netizens have learned already to manage their own privacy in an ad hoc way, for instance obfuscating or manipulating the personal details they divulge. Ultimately consumers and businesses alike will do better by engaging in a real deal that sets out how PI is truly valued and leveraged.

Dynamic consent models: The most important area for law and policy to catch up with technology seems to be in consent. As businesses discover new ways to refine raw data to generate value, individuals need to be offered better visibility of what’s going on, and new ways to opt out and opt back in again depending on how they gauge the returns on offer. This is the cutting edge of privacy policy and privacy enhancing technologies.

Further reading