Big Data? Big Privacy!
I gave a short speech at the launch of Australian Privacy Awareness Week #2013PAW on April 29. This is an edited version of my speaking notes.
What does privacy mean to technologists?
I'm a technologist who stumbled into privacy. Some 12 years ago I was doing a big security review at a utility company. Part of their policy document set was a privacy statement posted on the company's website. I was asked to check it out. It said things like 'We the company collect the following information about you [the customer] ... If you ever want a copy of the information we have about you, please call the Privacy Officer ...'. I had a hunch this was problematic, so I took it to the chief IT architect. He had never seen the statement before, and advised there was no way they could readily furnish complete customer details, for their CRM databases were all over the place.
Clearly there was a lot going on in privacy that we technologists needed to know. So with an inquiring mind, I read the Privacy Act. And I was amazed by what I found. In fact I wrote a paper in 2003 about the ramifications for IT of the 10 National Privacy Principles, and that kicked off my privacy sub-career.
Ever since I've found time and time again a shortfall in the understanding that "technologists" as a class have regarding data privacy. There is a gap between technology and the law. IT professionals may receive privacy training but as soon as they hear the well-meaning slogan "Privacy Is Not A Technology Issue" they tend to say 'thank god: that's one thing I don't need to worry about'. Conversely, privacy laws are written with some naivety about how information flows in modern IT and how it aggregates automatically in standard computer systems. For instance, several clauses in Australian privacy law refer expressly to making 'annotations' in the 'records' as if they're all paper based, with wide margins.
The gap is perpetuated to some extent by the popular impression that the law has not kept up with the march of technology. As a technologist, I have to say I am not cynical about the law; I actually find that principles-based data privacy law anticipates almost all of the current controversies in cyberspace (though not quite all, as we shall see).
So let's look a a couple of simple technicalities that technologists don't often comprehend.
What Privacy Law actually says
Firstly there is the very definition of Personal Information. Lay people and engineers tend to intuit that Personal Information is the stuff of forms and questionnaires and call centres. So technologists can be surprised that the definition of Personal Information covers a great deal more. Look at the definition from the Privacy Act:
Information or an opinion, whether true or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.
So if metadata or event logs in a computer system are personally identifiable, then they constitute Personal Information, even if this data has been completely untouched by human hands.
Then there is the crucial matter of collection. Our privacy legislation like that of most OECD countries is technology neutral with regards to the manner of collection of Personal Information. Indeed, the term "collection" is not defined in the Privacy Act. The word is used in its plain English sense. So if Personal Information has wound up in an information system, it doesn't matter if it was gathered directly from the individual concerned, or whether it has instead been imported or found in the public domain or generated almost from scratch by some algorithm: the Personal Information has been collected and as such is covered by the Collection Principle of the Privacy Act. That is to say:
An organisation must not collect Personal Information unless the information is necessary for one or more of its functions or activities.
Now let's look at some of the missteps that have resulted from technologists accidentally overlooking these technicalities (or perhaps technocrats more deliberately ignoring them).
1. Google StreetView Wi-Fi collection
Google StreetView cars collect Wi-Fi hub coordinates (as landmarks for Google's geo-location services). On their own Wi-Fi locations are unidentified, but it was found that the StreetView software was also inadvertently collecting Wi-Fi network traffic, some of which contained Personal Information (like user names and even passwords). Australian and Dutch Privacy Commissioners found Google was in breach of respective data protection laws.
Many technologists I found argued that Wi-Fi data in the "public domain" is not private, and "by definition" (so they liked to say) it categorically could not be private. Therefore they believed Google was within its rights to do whatever it liked with such data. But the argument fails to grasp the technicality that our privacy laws basically do not distinguish public from "private". In fact the words "public" and "private" are not operable in the Privacy Act (which is really more of a data protection law). If data is identifiable, then privacy sanctions attach to it.
The lesson for Big Data privacy is this: it doesn't much matter if Personal Information is sourced from the public domain: you are still subject to Collection and Use Limitation principles (among others) once it is in your custody.
2. Facebook facial recognition
Facebook photo tagging creates biometric templates used to subsequently generate tag suggestions. Before displaying suggestions, Facebook's facial recognition algorithms run in the background over all photo albums. When they make a putative match and record a deduced name against a hitherto anonymous piece of image data, the Facebook system has collected Personal Information.
European privacy regulators in mid 2012 found biometric data collection without consent to be a serious breach, and by late 2012 had forced Facebook to shut down facial recognition and tag suggestions in the EU. This was quite a show of force over one of the most powerful companies of the digital age.
The lesson for Big Data privacy is this: it doesn't much matter if you generate Personal Information almost out of thin air, using sophisticated data processing algorithms: you are still subject to Privacy Principles, such as Openness as well as Collection and Use Limitation.
3. Target's pregnancy predictions
The department store Target in the US was found by New York Times investigative journalists to be experimenting with statistical methods for identifying that a regular customer is likely to be pregnant, by looking for trends in her buying habits. Retail strategists are keen to win the loyalty of pregnant women so as to secure their lucrative business through the expensive early years of parenting.
There are all sorts of issues here. One technicality I wish to draw out is that in Australia, the privacy implications would be amplified by the fact that tagging someone in a database as pregnant [even if that prediction is wrong!] creates health information, and therefore represents a collection of Sensitive Information. Express informed consent is required in advance of collecting Sensitive Information. So if Australian stores want to use Big Data techniques, they may need to disclose to their customers up front that health information might be extracted by mining their buying habits, and obtain express consent for the algorithms to run. Remember Australia sets a low bar for privacy breaches: simply collecting Sensitive Personal Information may be a breach even before it is used for anything or disclosed.
Note also there is already a latent problem in Australia for grocery stores that sell medicinals online, and this has nothing to do with Big Data. St Johns Wort for example may seem innocuous but it indicates that a customer has (or believes they have) depression. IT security managers might not have thought about the implications of logging mental health information in ordinary old web servers and databases.
4. "DNA Hacking"
In February this year, research was published where a subset of anonymous donors to a DNA research program in the UK were identified by cross-matching genes to data in US based public genealogy databases. All of a sudden, the ethics of re-identifying genetic material has become a red hot topic. Much attention is focusing on the nature of the informed consent; different initiatives (like the Personal Genome Project and 1,000 Genomes) give different levels of comfort about the possibility of re-identification. Absolute anonymity is typically disclaimed but donors in some projects are reassured that re-identification will be 'difficult'.
But regardless of the consent given by a Subject (1st party) to a researcher (2nd party), a nice legal problem arises when a separate 3rd party takes anonymous data and re-identifies it without consent. Technically the 3rd party has collected Personal Information, as per the principles discussed above, and that may require consent under privacy laws. Following on from the European facial recognition precedent, I contend that re-identification of DNA without consent is likely to be ruled problematic (if not unlawful) in some jurisdictions. And it therefore unethical in all fair minded jurisdictions.
Big Data's big challenge
So principles-based data protection laws have proven and powerful in the cases of Google's StreetView Wi-Fi collection and Facebook's facial recognition (even though these scenarios could not have been envisaged with any precision 20 odd years ago when OECD style privacy principles were formulated). And they seem to neatly govern DNA re-identification and data mining for health information, insofar as we can foresee how these activities may conflict with legislated principles and might therefore be brought to book. But there is one area where our data privacy principles may struggle to cope with Big Data: openness.
Orthodox privacy management involves telling individuals What information is collected about them, Why it is needed, When it is collected, and How. But with Big Data, even if a company wants to be completely transparent, it may not know what Personal Information lies waiting to be mined and discovered in the data, nor when exactly this discovery might be done.
An underlying theme in Big Data business models is data mining, or perhaps more accurately, data refining, as shown in the diagram here. An increasing array of data processing techniques are applied to vast stores of raw information (like image data in the example) to extract metadata and increasingly valuable knowledge.
There is nothing intrinsically wrong with a business model that extracts value from raw information, even if it converts anonymous data into Personal Information. But the privacy promise enshrined in OECD data protection laws – namely to be open with individuals about what is known about them and why – can become hard to honour.
There is a bargain at the heart of most social media companies today, in which Personal Information is traded for a rich array of free services. The bargain is opaque; the "infomopolies" are coy about the value they attach to the Personal Information of their members.
If Online Social Networks were more open about their business models, I think it likely that most of members would still be happy with the bargain. After all, Google, Facebook, Twitter et al have become indispensable for many of us. They do deliver fantastic value. But the Personal Information trade needs to be transparent.
"Big Privacy" Principles
In conclusion, I offer some expanded principles for protecting privacy in Big Data.
Exercise constraint: More than ever, remember that privacy is essentially about restraint. If a business knows me, then privacy means simply that the business is restrained in how it uses that knowledge.
Meta transparency: We're at the very start of the Big Data age. Who knows what lies ahead? Meta transparency means not only being open about what Personal Information is collected and why, but also being open about the business model and the emerging tools.
Engage customers in a fair value deal: Most savvy digital citizens appreciate there is no such thing as a free lunch; they already know at some level that "free" digital services are paid for by trading Personal Information. Many netizens have learned already to manage their own privacy in an ad hoc way, for instance obfuscating or manipulating the personal details they divulge. Ultimately consumers and businesses alike will do better by engaging in a real deal that sets out how PI is truly valued and leveraged.
Dynamic consent models: The most important area for law and policy to catch up with technology seems to be in consent. As businesses discover new ways to refine raw data to generate value, individuals need to be offered better visibility of what's going on, and new ways to opt out and opt back in again depending on how they gauge the returns on offer. This is the cutting edge of privacy policy and privacy enhancing technologies.
Further reading
The beginning of privacy!
The headlines proclaim that the newfound ability to re-identify anonymous DNA donors means The End Of Privacy!.
No it doesn't, it only means the end of anonymity.
Anonymity is not the same thing as privacy. Anonymity keeps people from knowing what you're doing, and it's a vitally important quality in many settings. But in general we usually want people (at least some people) to know what we're up to, so long as they respect that knowledge. That's what privacy is all about. Anonymity is a terribly blunt instrument for protecting privacy, and it's also fragile. If anonymity was all you have, then you're in deep trouble when someone manages to defeat it.
New information technologies have clearly made anonymity more difficult, yet it does not follow that we must lose our privacy. Instead, these developments bring into stark relief the need for stronger regulatory controls that compel restraint in the way third parties deal with Personal Information that comes into their possession.
A great example is Facebook's use of facial recognition. When Facebook members innocently tag one another in photos, Facebook creates biometric templates with which it then automatically processes all photo data (previously anonymous), looking for matches. This is how they can create tag suggestions, but Facebook is notoriously silent on what other applications it has for facial recognition. Now and then we get a hint, with, for example, news of the Facedeals start up last year. Facedeals accesses Facebook's templates (under conditions that remain unclear) and uses them to spot customers as they enter a store to automatically check them in. It's classic social technology: kinda sexy, kinda creepy, but clearly in breach of Collection, Use and Disclosure privacy principles.
And indeed, European regulators have found that Facebook's facial recognition program is unlawful. The chief problem is that Facebook never properly disclosed to members what goes on when they tag one another, and they never sought consent to create biometric templates with which to subsequently identify people throughout their vast image stockpiles. Facebook has been forced to shut down their facial recognition operations in Europe, and they've destroyed their historical biometric data.
So privacy regulators in many parts of the world have real teeth. They have proven that re-identification of anonymous data by facial recognition is unlawful, and they have managed to stop a very big and powerful company from doing it.
This is how we should look at the implications of the DNA 'hacking'. Indeed, Melissa Gymrek from the Whitehead Institute said in an interview: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously."
Perhaps this episode will bring even more attention to the problem in the USA, and further embolden regulators to enact broader privacy protections there. Perhaps the very extremeness of the DNA hacking does not spell the end of privacy so much as its beginning.
Posted in Social Media, Science, Privacy, Biometrics, Big Data
Letter to Science: Re-identification of DNA may need ethics approval
Introduction
I had a letter published in Science magazine about the recently publicised re-identification of anonymously donated DNA data. It has been shown that there is enough named genetic information online, in genealogical databases for instance, that anonymous DNA posted in research databases can be re-identified. This is a sobering result indeed. But does it mean that 'privacy is dead'?
No. The fact is that re-identification of erstwhile anonymous data represents an act of collection of PII and is subject to the Collection Limitation Principle in privacy law around the world. This is essentially the same scenario as Facebook using biometric facial recognition to identify people in photos. European regulators recently found Facebook to have breached privacy law and have forced Facebook to shut down their facial recognition feature.
I expect that the very same legal powers will permit regulators to sanction the re-identification of DNA. There are legal constraints on what can be done with 'anonymous' data no matter where you get it from: under some data privacy laws, attaching names to such data constitutes a Collection of PII, and as such, is subject to consent rules and all sorts of other principles. As a result, bioinformatics researchers will have to tread carefully, justifying their ends and their means before ethics committees. And corporations who seek to exploit the ability to put names on anonymous genetic data may face the force of the law as Facebook did.
To summarise: Let's assume Subject S donates their DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S as belonging to the DNA sample. The fact that many commentators seem oblivious to is this: R2 has Collected Personal Information (or PII) about S. If R2 has no relationship with S, then S has not consented to this new collection of her PII. In jurisdictions with strict Collection Limitation (like the EU, Australia and elsewhere) then it seems to me to be a legal privacy breach for R2 to collect PII by way of DNA re-identification without express consent, regardless of whether R1 has conceded to S that it might happen. Even in the US, where the protections might not be so strict, there remains a question of ethics: should R2 conduct themselves in a manner that might be unlawful in other places?
The text of my letter to Science follows, and after that, I'll keep posting follow ups.
Legal Limits to Data Re-Identification
Science 8 February 2013:
Vol. 339 no. 6120 pp. 647
Yaniv Erlich at the Whitehead Institute for Biomedical Research used his hacking skills to decipher the names of anonymous DNA donors ("Genealogy databases enable naming of anonymous DNA donor," J. Bohannon, 18 January, p. 262). A little-known legal technicality in international data privacy laws could curb the privacy threats of reverse identification from genomes. "Personal information" is usually defined as any data relating to an individual whose identity is readily apparent from the data. The OECD Privacy Principles are enacted in over 80 countries worldwide [1]. Privacy Principle No. 1 states: "There should be limits to the collection of personal data and any such data should be obtained by lawful and fair means and, where appropriate, with the knowledge or consent of the data subject." The principle is neutral regarding the manner of collection. Personal information may be collected directly from an individual or indirectly from third parties, or it may be synthesized from other sources, as with "data mining."
Computer scientists and engineers often don't know that recording a person's name against erstwhile anonymous data is technically an act of collection. Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the information later signifies a new collection. The new collection of personal information requires its own consent; the original disclaimer does not apply when third parties take data and process it beyond the original purpose for collection. Educating those with this capability about the legal meaning of collection should restrain the misuse of DNA data, at least in those jurisdictions that strive to enforce the OECD principles.
It also implies that bioinformaticians working "with little more than the Internet" to attach names to samples may need ethics approval, just as they would if they were taking fresh samples from the people concerned.
Stephen Wilson
Lockstep Consulting Pty Ltd
Five Dock Sydney, NSW 2046, Australia.
E-mail: swilson@lockstep.com.au
Followup
In an interview with Science Magazine on Jan 18, the Whitehead Institute's Melissa Gymrek discussed the re-identification methods, and the potential to protect against them. She concluded: "I think we really need to learn to deal with the fact that we cannot ever make data sets truly anonymous, and that I think the key will be in regulating how we are allowed to use this genetic data to prevent it from being used maliciously.".
I agree completely. We need regulations. Elsewhere I've argued that anonymity is an inadequate way to protect privacy, and that we need a balance of regulations and Privacy Enhancing Technologies. And it's for this reason that I am not fatalistic about the fact that anonymity can be broken, because we have the procedural means to see that privacy is still preserved.
Classic Facebook stalking horse
Yesterday Instagram made its first move towards delivering the real value in its acquisition by Facebook. They revised their Privacy Policy and Terms of Use to allow greater sharing of photos with Facebook and other businesses, especially advertisers. Instagram posted a new set of Terms on Monday, the shit hit the fan, and today they back-peddled.
The mea culpa is a classic, straight out of the Zuckerberg copybook. They say they were misunderstood. They say they don't want to sell photos to ad men. They say members will always own their photos. But ownership is a red herring and the whole exercise is likely a stalking horse, designed to distract people from more significant issues around metadata and Facebook's ever deepening ability to infer PII.
Firstly, let's be clear that greater sharing follows the acquisition as night follows day. I noted at the time that the only way to understand Facebook's billion dollar spend on Instagram is around the value to be mined from the mother lode of photo data. In particular, image analysis and facial recognition grant Instagram and Facebook x-ray vision into their members' daily lives. They can work out what people are doing, with whom they're doing it, when and where. With these tools, they're moving quickly from collecting Personally Identifiable Information when it is volunteered by users, to PII that is observed and inferred. The quality and quantity of the PII flux is driven up dramatically. No longer is the lifeblood of Facebook -- the insights they have on 15% of the world's population -- filtered by what users elect to post and Like and tag, but now that information is raw, unexpurgated and automated.
Now ask where the money in photo data is to be made. It's not in selling candid snapshots of folks enjoying branded products. It's in the intelligence that image data yield about how people lead their lives. This intelligence is Facebook's one and only asset.
So it is metadata that we need to worry about. In its initial update to the Terms, Instagram said this: [You] agree that a business or other entity may pay us to display your username, likeness, photos (along with any associated metadata), and/or actions you take, in connection with paid or sponsored content or promotions, without any compensation to you.. In over 6,000 words "metadata" is mentioned just twice, parenthetically, and without any definition. Metadata is figuring more and more in the privacy discourse, and that's great, but we need to look beyond the usual stuff like geolocation and camera type embedded in the JPEGs. Much more important now is the latent identifiable personal content in images. Image analysis and image search provide endless new possibilities for infomopolies to extract value from photos.
A great deal of this week's outcry has focused on things like the lack of compensation, and all of Instagram's apology today is around the ownership of photos. But ownership is moot if they reserve their right to use and disclose metadata in any way they like. What actually matters is the individual's ability to understand and control what is done with any PII about them, including metadata. When the German privacy regulator acted against Facebook's facial recognition practices earlier this year, the principle they applied from OECD style legislation is that there are limits to what can be collected about individuals without their consent. The regulator ruled it unlawful for Facebook to extract biometric information from images when their users innocently think they're only tagging people in photos.
So when I read Instagram's excuse, I don't see any truly meaningful self-restraint in the way they can exploit image data. Their switch is not even a tactical retreat, for as yet, they're not giving anything up.
Posted in Social Networking, Privacy, Big Data
It's not too late for privacy
It's an urgent, impatient sort of line in the sand, drawn by the new masters of the universe digital, as a challenge to everyone else. C'mon, get with the program! Innovate! Don't be so precious - so very 20th century! Don't you dig that Information Wants To Be Free? Clearly, old fashioned privacy is holding us back!
The stark choice posited between privacy and digital liberation is rarely examined with much diligence; often it's actually a fatalistic response to the latest breach or the latest eye popping digital development. In fact, those who earnestly assert that privacy is dead are almost always trying to sell us something, be it a political ideology, or a social networking prospectus, or sneakers targeted at an ultra-connected, geolocated, behaviorally qualified nano market segment.
Is it really too late for privacy? Is the genie out of the bottle? Even if we accepted the ridiculous premise that privacy is at odds with progress, no it's not too late, firstly because the pessimism (or commercial opportunism) generally confuses secrecy for privacy, and secondly because frankly, we aint seen nothin yet!
Technology certainly has laid us bare. Behavioural modeling, facial recognition, Big Data mining, natural language processing and so on have given corporations x-ray vision into our digital lives. While exhibitionism has been cultivated and normalised by the infomopolists, even the most guarded social network users may be defiled by Big Data wizards who without consent upload their contact lists, pore over their photo albums, and mine their shopping histories, as is their wanton business model.
So yes, a great deal about us has leaked out into what some see as an extended public domain. And yet we can be public and retain our privacy at the same time.
Some people seem defeated by privacy's definitional difficulties, yet information privacy is simply framed, and corresponding data protection laws readily understood. Information privacy is basically a state where those who know us are restrained in what they can do with the knowledge they have about us. Privacy is about respect, and protecting individuals against exploitation. It is not about secrecy or even anonymity. There are few cases where ordinary people really want to be anonymous. We actually want businesses to know -- within limits -- who we are, where we are, what we've done, what we like, but we want them to respect what they know, to not share it with others, and to not take advantage of it in unexpected ways. Privacy means that organisations behave as though it's a privilege to know us.
Many have come to see privacy as literally a battleground. The grassroots Cryptoparty movement has come together around a belief that privacy means hiding from the establishment. Cryptoparties teach participants how to use Tor and PGP, and spread a message of resistance. They take inspiration from the Arab Spring where encryption has of course been vital for the security of protestors and organisers. The one Cryptoparty I've attended so far in Sydney opened with tributes from Anonymous, and a number of recorded talks by activists who ranged across a spectrum of social and technosocial issues like censorship, copyright, national security and Occupy. I appreciate where they're coming from, for the establishment has always overplayed its security hand. Even traditionally moderate Western countries have governments charging like china shop bulls into web filtering and ISP data retention, all in the name of a poorly characterised terrorist threat. When governments show little sympathy for netizenship, and absolutely no understanding of how the web works, it's unsurprising that sections of society take up digital arms in response.
Yet going underground with encryption is a limited privacy stratagem, for DIY crypto is incompatible with the majority of our digital dealings. In fact the most nefarious, uncontrolled and ultimately the most dangerous privacy harms come from mainstream Internet businesses and not government. Assuming one still wants to shop online, use a credit card, tweet, and hang out on Facebook, we still need privacy protections. We need limitations on how our Personally Identifiable Information (PII) is used by all the services we deal with; we need department stores to refrain from extracting sensitive health information from our shopping habits, merchants to not use our credit card numbers as customer reference numbers, and online social networks to not x-ray our photo albums by biometric face recognition. I note that some Cryptoparty bookings are managed by the US event organiser Eventbrite, which has a detailed Privacy Policy setting out how it promises to handle personal information provided by attendees. It does seems reasonable to me, but like all private sector data protection arrangements, there's a lot going on there.
So ironically, when registering for a cryptoparty, you could not use encryption! For privacy, you have to either trust Eventbrite to have a reasonable policy and to stick to it, or you might rely on government regulations, if applicable. When registering, you give a little Personal Information to the organisers, and we expect that they will be restrained in what they do with it.
Going out in public never was a license for others to invade our privacy. We ought not to respond to online privacy invasions as if cyberspace is a new Wild West. We have always relied on regulatory systems of consumer protection to curb the excesses of business and government, and we should insist on the same in the digital age. We should not have to hide away if privacy is agreed to mean respecting the PII of customers, users and citizens, and restraining what data custodians do with that precious resource.
I ask anyone who thinks it's too late to reassert our privacy to think for a minute about where we're heading. We're still in the early days of the social web, and the information "innovators" have really only just begun. Look at what they've done so far:
- Facial recognition converts vast stores of anonymous photos into PII, without consent, and without limit. Facebook's deployment of biometric technology was especially clever. For years they crowd-sourced the creation of templates and the calibration of their algorithms, without ever mentioning facial recognition in their privacy policy or help pages. Even now Facebook's Data Use Policy is entirely silent on biometric templates and what they allow themselves to do with them. Meanwhile, third party services like Facedeals are starting to use Facebook's photo resources for commercial facial recognition in public.
- Big Data. The most notorious recent example of the power of data mining comes from Target's covert research into identifying customers who are pregnant based on their buying habits. Big Data practitioners are so enamoured with their ability to extract secrets from "public" data they seem blithely unaware that by generating fresh PII from their raw materials they are in fact collecting it as far as Information Privacy Law is concerned. As such, they’re legally liable for the privacy compliance of their cleverly synthesised data, just as if they had expressly gathered it all by questionnaire.
- Natural Language Processing (NLP) is the secret sauce in Apple's Siri, allowing her to take commands -- and dictation. Every time you dictate an email or a text message to Siri, Apple gets hold of the content of telecommunications that are normally out of bounds to the phone companies. Siri is like a free PA that reports your daily activities back to the secretarial agency. There is no mention at all of Siri in Apple's Privacy Policy despite the limitless collection of intimate personal information.
It's difficult to overstate the value of facial recognition to businesses like Facebook which have just one asset: the knowledge they have about their members. Combined with image analysis and content addressable image banks, facial recognition lets Facebook work out what we're doing, when, where and with whom, pirating billions of everyday images given over by members to a business that doesn't even mention these priceless resources in its privacy policy.
As an aside, I'm not one of those who fret that technology has outstripped privacy law. Principles-based Information Prvacy law copes well with most of this technology. OECD privacy principles (enacted in over seventy countries) and the US FIPPs require that companies be transarent about what PII they collect and why, and that they limit the ways in which PII is used for unrelated purposes, and how it may be disclosed. These principles are decades old and yet they have been recently re-affirmed by German regulators recently over Facebook's surreptitious use of facial recognition. I expect that Siri will attract like scrutiny as it rolls out in continental Europe.
So what's next?
- Google Glass may, in the privacy stakes, surpass both Siri and facial recognition of static photos. If actions speak louder than words, imagine the value to Google of digitising and knowing exactly what we do in real time.
- Facial recognition as a Service and the sale of biometric templates may be tempting for the photo sharing sites. If and when biometric authentication spreads into retail payments and mobile device security, these systems will face the challenge of enrollment. It might be attractive to share face templates previously collected by Facebook and voice prints by Apple.
So, is it really too late for privacy? The infomopolists and national security zealots may hope so, but surely even cynics will see there is great deal at stake, and that it might be just a little too soon to rush to judge something as important as this.
Posted in Social Networking, Social Media, Privacy, Culture, Big Data
Types of Personal Information
It's really vital that technologists, software developers, architects and analysts appreciate that privacy law takes a broad view of "Personal Information" and how it may be collected. In essence, whenever any information pertaining to an identifiable individual comes to be in your IT system by whatever means, you may be deemed to have collected Personal Information it for the purposes of the law (for example, Australia's Privacy Act 1988 Cth). And what follows from any PI collection is a range of legal obligations relating to the 10 National Privacy Principles.
A while back I tried to illuminate the problem space from a technologist's standpoint, in a paper called Mapping privacy requirements onto the IT function (Privacy Law & Policy Reporter, 2003). At the time it seemed useful to me to break down different types of Personal Information, because I had found that most application developers only thought about questionnaires and web forms. I wrote then:
Personal data collection can be considered under five categories:
(1) overt collection via application forms, web forms, call centres, face to face interviews, questionnaires, warranty cards and so on;
(2) automatic collection, especially via audit logs and transaction histories;
(3) generated data, which includes evaluative data and inferences drawn from collected data, for the purposes of service customisation (for example buying preferences), business risk management (such as insurance risk scores from claims histories) and so on;
(4) acquired data which has been transferred from a third party, with or without payment for the data, including cases where personal information is acquired as part of a corporate takeover; and
(5) ephemeral data, which is a special category of automatic or generated data, produced as a side effect of other operations. Ephemeral data is reasonably presumed to be transient but can be inadvertently retained. For example, some systems prompt users for pre-arranged challenge-response information -- classically their mother’s maiden name -- when dealing with a forgotten password. The data provided can be left behind in computer memory or logs, or even scribbled on a sticky note by a help desk operator, and this can represent a major privacy breach if it is not protected from unauthorised parties.
This may still be a useful orientation for many engineers and technologists. They need to remember that even if it's found lying around in the public domain, or even if they've conjured it up from Big Data by clever data anaysis, if they have got their hands on Personal Information, then they have collected it.
Speaking of Big Data, I wonder if the categorisation of Personal Data could now be improved or extended?
Photo data as crude oil
It's been said that "data is the new oil". The immense stores of Personal Information gifted to Facebook, Google et al by their users are like crude oil reserves: raw material to be tapped, refined, processed and value-added.
[ Update 19 Dec 2012: Instagram has predictably revised its Privacy Policy and Terms of Use to integrate with better Facebook. They put it this way: "As part of our new collaboration, we've learned that by being able to share insights and information with each other, we can build better experiences for our users" (emphasis added). Of course it was inevitable that they would lubricate the disclosure of Instagram pictures with their owners and beyond. Facebook didn't buy them for nothing. ]
[ Update 8 Dec 2012: Some have poo-poohed the comparison with crude oil, including the New York Times' Jer Thorp. No metaphor is ever complete, and this one might distract some people, but the idea is not meant to be about fossil fuels and finite resources. Rather it alludes to the undifferentiated nature of raw data and the high tech ways in which Big Data is refined to create wondrous new products. I like the historical and political context of the oil metaphor too. Right now we are at a historical point comparable to that of the Black Gold prospectors of the 1800s; new supply chains and business models are being devised to exploit this new bounty. The parallels with the oil industry remind us that Big Data is Big Business! ]
I'm especially interested in photo data, and the rapid evolution of tools for monetising it. These tools range from embedded metadata in the uploded photos, through to increasingly sophisticated object recognition and facial recognition algorithms.
Image analysis can extract place names and product names from photos, and recognise objects. It can re-identify faces using biometric templates that users have helpfully created by tagging their friends in entirely unrelated images. Image analysis lets social media companies work out what you're doing, when and where, and who you're doing it with. If Facebook can work out from a photo that you're enjoying a coffee at a recognisable retail outlet, they don't need you to expressly "Like" it. Nor do you have to actively check in to the cafe when most phones tag their photos with geolocation data. Instead, Facebook will automatically file away another little bit of Personal Information, to be melded into the amazingly rich picture they're relentlessly building up.
The ability to extract value from photo data defines a new black-gold rush. Like petroleum engineering, Image Analysis is high tech stuff. There is extraordinary R&D going on in face recognition and object recognition, and the "infomopolies" like Apple, Google and Facebook pay big bucks for IP and startups in this space.
I think there is only one way to look at Facebook's acquisition of Instagram. With 250 million new pictures being added everyday, Instagram is like an undeveloped crude oil field. As such, a billion dollars seems like a bargain.
So Facebook's core business isn't all of a sudden photo sharing. It always was and always will be PI refining.


Posted in Social Media, Privacy, Big Data
A penny for your marketable thoughts?
Most people think that Apple's Siri is the coolest thing they've ever seen on a smart phone. It certainly is a milestone in practical human-machine interfaces, and will be widely copied. The combination of deep search plus natural language processing (NLP) plus voice recognition is dynamite.
And Siri also marks a new milestone in privacy invasion. I predict Siri will become the poster girl for PII piracy, the exemplar of the sly bargain for Personal Information at the heart of most social media.
If you haven't had the pleasure ... Siri is a wondrous new function built into the latest iPhone. It’s the state-of-the-art in artificial intelligence and NLP. You speak directly to Siri, ask her questions (yes, she's female) and tell her what to do with many of your other apps. Siri integrates with mail, text messaging, maps, search, weather, calendar and so on. Ask her "Will I need an umbrella in the morning?" and she'll look up the weather for you – after checking your calendar to see what city you’ll be in tomorrow. It's amazing.
Natural Language Processing is a fabulous idea of course. It radically improves the usability of smart phones, and even their safety with much improved hands-free operation.
An important technical detail is that NLP is very demanding on computing power. In fact it's beyond the capability of today's smart phones, even if each of them alone is more powerful than all of NASA's computers in 1969!. So all Siri's hard work is actually done on Apple's mainframe computers scattered around the planet. That is, all your interactions with Siri are sent into the cloud.
Imagine Siri was a human personal assistant. Imagine she's looking after your diary, placing calls for you, booking meetings, planning your travel, taking dictation, sending emails and text messages for you, reminding you of your appointments, even your significant other’s birthday. She's getting to know you all the while, learning your habits, your preferences, your personal and work-a-day networks.
And she's free!
Now, wouldn't the offer of a free human PA strike you as too good to be true?
Indeed it would. So realise this about Siri: she's continuously reporting back to Apple about your every move. If Apple were a PA placement agency, what they get in return for the free secretarial services is a full transcript of all you've said, everyone you've been in touch with, everything you've done. Apple won't say what they plan to do with all this data, how long they'll keep it, nor who they'll share it with. Apple's Privacy Policy (dated October 2011, accessed 12 March 2012) doesn't even mention Siri nor the collection of the voice-to-text data.
When you dictate your mails and text messages to Siri, you’re providing Apple with content that's usually off limits to carriers, phone companies and ISPs. Siri is an end run around telecommunicationss intercept laws.
Of course there are many, many examples of where free social media apps mask a commercial bargain. Face recognition is the classic case. It was first made available on photo sharing sites as a neat way to organise one’s albums, but then Facebook went further by inviting photo tags from users and then automatically identifying people in other photos on others' pages. What's happening behind the scenes is that Facebook is running its face recognition templates over the billions of photos in their databases (which were originally uploaded for personal use long before face recognition was deployed). Given their business model and their track record, we can be certain that Facebook is using face recognition to identify everyone they possibly can, and thence work out fresh associations between countless people and situations accidentally caught on camera. Combine this with image processing and visual search technology (like Google's "Goggles") and the big social media companies have an incredible new eye in the sky. They can work out what we're doing, when, where and with whom. Nobody will need to like expressly "like" anything anymore when Facebook can see what cars we're driving, what brands we're wearing, where we spend our vacations, what we're eating, what makes us laugh. Apple, Facebook and others have understandably invested hundreds of millions of dollars in image recognition start-ups and intellectual property; with these tools they convert the hitherto anonymous image collections in Picassa, Flickr and the like into content-addressable PII gold mines. It's the next frontier of Big Data.
Now, there wouldn't be much wrong with these sorts of arrangements if the social media corporations were up-front about them. In their Privacy Policies they should detail what Personal Information they are extracting and collecting from all the voice and image data; they should explain why they collect this information, what they plan to do with it, how long they will retain it, and how they promise to limit secondary usage. They should explain that biometrics technology allows them to generate brand new PII out of members' snapshots and utterances. And they should acknowledge that by rendering data identifiable, they become accountable in many places under privacy and data protection laws for its safekeeping as PII. It's just not good enough to vaguely reserve their rights to "use personal information to help us develop, deliver, and improve our products, services, content, and advertising". They should treat their customers -- and all those innocents about whom they collect PII indirectly -- with proper respect, and stop pretending that 'service improvement' is what they're up to.
Siri along with face recognition herald a radical new type of privatised surveillance, and on a breathtaking scale. While Facebook stealthily "x-ray" photo albums without consent, Apple now has even more intimate access to our daily routines and personal habits. And they don’t even pay as much as a penny for our thoughts.
As cool as Siri may be, I myself will decline to use any natural language processing while the software runs in the cloud, and while the service providers refuse to restrain their use of my voice data. I'll wait for NLP to be done on my device with my data kept private.
And I'd happily pay cold hard cash for that kind of app, instead of having an infomopoly embed itself in my personal affairs.
Posted in Social Media, Privacy, Language, Biometrics, Big Data, Social Networking
What stops Target telling you're pregnant?
Question: What stops Target from telling that you're pregnant?
Answer: In many parts the world, the law!
The recent New York Times feature How Companies Learn Your Secrets caused a stir. Investigative reporter Charles Duhigg details conversations he had with data analysts and statisticians about what marketing gold they can divine from shoppers' buying habits ... and how one department store then seemed to shut down the dialogue.
The case in point was the ability to statistically predict pregnancy. Duhigg and his contacts looked into the enormous business potential for retailers if they could work out from what they're buying that individual customers were in the early stages of pregnancy. One analyst said "We knew that if we could identify them in their second trimester, there's a good chance we could capture them for years". Insiders admitted to developing and testing a "pregnancy prediction" score but it remains unclear to what extent such tools are used in practice with real data.
This is pretty heady stuff, on the leading edge of Big Data analytics.
What kind of problem is this?
Charles Duhigg's NYT feature ends on a note of resignation, and I get the impression from scanning blog posts on this matter that many people -- especially in the largely unregulated United States -- are feeling powerless to do anything about this. Yet they should take heart from existing privacy law, at least in places like Australia with OECD-based data protection legislation, it's pretty clear for anyone who actually reads the rules, that for a department store to work out and record that someone is pregnant is likely be unlawful.
A look at how Australia regulates privacy
At state and federal level, Australia has several privacy acts and health records acts. For our purposes here, they're all much the same. And I repeat that the following analysis is likely to have parallels in many other countries. I will use the Victorian Health Records Act 2001 (the "Act") as a model; underlining in the quoted passages is added by me for emphasis.
Personal Information is defined in the Act as:
information or an opinion (including information or an opinion
forming part of a database), whether true or not, and whether
recorded in a material form or not, about an individual whose
identity is apparent, or can reasonably be ascertained
from the information or opinion
At this point, note that the definition is broad and unqualified by such abstract matters as data ownership. In the Australian legal system, privacy rights attach to any information whatsoever pertaining to an identifiable individual, whether that information is explicitly collected from the person, or generated automatically by Big Data processing.
Health Information is defined as, amongst other things:
(i) the physical, mental or psychological health
(at any time) of an individual; or
(ii) a disability (at any time) of an individual; or
(iii) an individual's expressed wishes about the
future provision of health services to him or her
The cornerstones of privacy in OECD-style data protection systems are Collection Limitation and Use Limitation. Here are the opening clauses of Victoria's Health Privacy Principle HPP 1 - Collection:
1.1 When health information may be collected
An organisation must not collect health information about an
individual unless the information is necessary for one or more
of its functions or activities and at least one of the following
applies -
(a) the individual has consented;
(b) the collection is required, authorised or permitted,
whether expressly or impliedly, by or under law;
(c) the information is necessary to provide a health service ...
Note that consent is required in advance of collecting health information, whereas in the case of regular Personal Information, organisations have more latitude to give notice of collection reasonably after the fact.
And here are the opening clauses of Health Privacy Principle HPP 2 - Use & Disclosure:
2.1 An organisation may use or disclose health information about
an individual for the primary purpose for which the information was
collected in accordance with HPP 1.1.
2.2 An organisation must not use or disclose health information about
an individual for a purpose (the secondary purpose) other than the
primary purpose for which the information was collected unless
at least one of the following paragraphs applies -
(a) both of the following apply -
(i) the secondary purpose is directly related to the primary purpose; and
(ii) the individual would reasonably expect the organisation to use or
disclose the information for the secondary purpose; or
(b) the individual has consented to the use or disclosure ...
HPP 1 goes on to sanction how individuals should be kept informed about the collection of health information about them:
How health information is to be collected
1.4 At or before the time (or, if that is not practicable,
as soon as practicable thereafter) an organisation collects
health information about an individual from the individual,
the organisation must take steps that are reasonable in the
circumstances to ensure that the individual is generally aware of -
(a) the identity of the organisation and how to contact it; and
(b) the fact that he or she is able to gain access to the
information; and
(c) the purposes for which the information is collected; and
(d) to whom (or the types of individuals or organisations to which)
the organisation usually discloses information of that kind; and
(e) any law that requires the particular information to be
collected; and
(f) the main consequences (if any) for the individual if all or
part of the information is not provided.
1.5 If an organisation collects health information about an
individual from someone else, it must take any steps that are
reasonable in the circumstances to ensure that the individual
is or has been made aware of the matters listed in HPP 1.4 except
to the extent that making the individual aware of the matters
would pose a serious threat to the life or health of any
individual or would involve the disclosure of information
given in confidence.
Conclusion: Don't give up on privacy!
On my reading of the Act, we can be sure of the following:
- If a department store mines its data on shopping habits, determines that a named woman is likely to be pregnant, and records that prediction in a database, then the store will have collected health information about her and is subject to health privacy legislation in several states (as well as the Sensitive Personal Information clauses of Australia's federal privacy law).
- If the department store has not obtained the customer's consent to having the state of her pregnancy being determined, then the store will have breached HPP 1.1.
- If the store uses information originally collected from customers to monitor their shopping habits to generate new information predicting their pregnancies, then it will have breached HPP 2.2.
- If the store has not informed the woman that they have predicted she is pregnant, then it will have breached HPP 1.5.
Many commentators fear that the march of technology outpaces the law, but I for one am more optimistic. For the most part, it seems our current information privacy law actually copes well with the sorts of business activities we find intuitively problematic. I am not a lawyer but it looks clearly unlawful to me if a department store in Australia were to purposefully works out its customers are pregnant. Technically, just recording that prediction even without acting upon it probably counts as a Collection of health information and as such it needs the consent of the customer.
The same legal principles apply -- with even more force -- in Europe. It remains to be seen whether information privacy can be better regulated in the US through the FIPPs or other mechanisms.
Public yet still private
Privacy is a notoriously slippery topic. Even the word "privacy" has eluded universally accepted definition. Yet information privacy (aka data protection) law is really pretty straightforward, even if the implications of these laws are counter-intuitive for some. A degree of ignorance of privacy law has led to some infamous missteps. Here I'm going to review data privacy law, and look at how some of the big Internet brands continue to misunderstand privacy technicalities, at their peril.
There can be endless arguments about the meaning of privacy. Not only is it intensely personal, it also ranges across philosophy, human rights, civil liberties and politics.
Sometimes people try to analyse privacy rights through the legal frameworks of copyright or even data ownership, but these are not fruitful approaches. Copyright of course is a thorny issue; intellectual property rights are controversial in cyberspace and they seem to only complicate privacy. As for "ownership", well philosophers are still working out what they can even mean for data.
Australia's Privacy Act, like most such information privacy and data protection law worldwide, neatly side steps the moral and philosophical minefields.
Paradoxically, the words "private" and "public" don't even figure in the Privacy Act. Instead the focus is on Personal Information -- namely any information or opinion about an individual where their identity is "apparent or can reasonably be ascertained" -- and how it is handled. Note that the definition captures a lot more than personal details expressly provided by forms and questionnaires; it includes any data at all associated with an individual.
Consultants often advise that privacy and security are different things. And so they are, but more even importantly, privacy is only partially related to confidentiality and secrecy. Privacy is really all about control. Paradoxically perhaps, anonymity is not necessary for privacy; neither does having details about oneself in the public domain mean that data escapes all privacy regulations. For information privacy, simply stated, is a state where organisations respect the knowledge they have about you, and are restrained in what they do with it.
All information privacy or data protection law (in jurisdictions that have it) centres on the following principles, amongst others:
― The Collection Principle means a business generally cannot gather (or acquire or even generate) Personal Information if it is not required for a defined business function, and without the individual's consent.
― The Use & Disclosure Principles mean that information gathered (or created) for one purpose cannot be used for unrelated secondary purposes without consent, nor can it be disclosed to unrelated parties.
― The Openness Principle requires organisations collecting Personal Information to clearly set out exactly what they collect and why, usually in the form of a documented Privacy Policy.
― The Access & Correction Principles mean that an individual usually has the right to be given access to all Personal Information held by a business about them, and to have any errors fixed.
Some of the implications may be surprising, especially for technologists.
Privacy law is blind to how information is collected. It doesn't matter how Personal Information comes to be in your business; even if Personal Information is generated internally from audit logs or evaluative processes, once you have it, you are deemed to have made a collection according to privacy law. Moreover, even if Personal Information is collected from the public domain, it is still subject to privacy law.
[Update Feb 2013: A couple of more recent cases have highlighted also the difference between anonymity/secrecy and privacy. In many places and especially Europe, privacy is much more about granting people control over how their Personal Information is used, than it is about keeping all information secret. Therefore when anonymity is occasionally lost, individuals still have rights and legal recourse should their information be abused. The best example is that European regulators found Facebook's facial recognition processes to breach the Collection Limitation principle and had Facebook shut it down. The lesson is: big data processes or biometrics may give technologists fabulous powers to re-identify anonymous or 'public' data but those powers cannot be used willy-nilly. Another potential test case is that of the 'DNA hacking' reported in early 2013 where bioinformaticians cleverly used genealogical data from public websites to re-identify anonymous DNA donors. And then we have Google Glass which will inevitably generate boundless identification of people and objects captured on video in your daily walk through life. "Boundless" that is if Google disregards the Collection principle. See also my recent post "The beginning of privacy". ]
An important recent case is Google's collection of wifi data from open home networks by StreetView cars. Some argue it's careless for people to not encrypt their wireless setups, but the fact is that data gathered by sniffing networks is subject to the Privacy Act if it relates to individuals that can be identified (and with Google's vast linked databases, working out identities is assumed to be within their powers). A person has not agreed to the exploitation of their information merely because they might be lax with their security.
Some say privacy law hasn't kept up with technology. For the most part, established principles-based information privacy law does work well in cyberspace, for it is fundamentally all about the rights of individuals to have some control over who knows what about them. Information privacy principles are a powerful and straightforward way to analyse personal rights even in dynamic and complicated settings like online social networking. So conventional information privacy law is being used in Germany and elsewhere to curtail the more excessive practices of Google (collection of personally identifiable wifi transmissions) and of Facebook (generation of biometric templates from photo tagging and re-use of those templates to identify people in images data).
Yet networking technology does challenge privacy principles. We all know why Facebook, Twitter, Google and LinkedIn offer such fantastic services for free: it's because they're generating vast commercial value from the network information and Big Data they're amassing. Information privacy law requires that individuals be informed as to why Personal Information is collected about them and how it's going to be used. But if sophisticated data analytics and ever increasing networks of information lead to discoveries that aren't apparent until critical mass is reached, then it's actually impossible to inform members up front about the precise collection purpose. Instead, businesses should share more of the spoils of social networking with their customers, who typically gladly opt in if properly rewarded for participating in what is still a great big experiment.
This fundamental clash with the Collection Principle is the only case I know of where technology really has outstripped privacy law.
Posted in Social Networking, Privacy, Culture, Big Data