The immovable opacity of AI meets the irresistible force of privacy

Analytics and big-data processes have been brought to heel by privacy laws. Artificial intelligence may be next.

The power of technology neutral privacy principles

Large language models (LLMs) and generative AI are developing at ever-increasing speed, alarming many commentators because it’s now so hard to tell fact from fiction. Naturally there are calls for new regulations. But we should also be looking at how AI comes under the principles-based privacy regulations we already have.

I’ve written before about the “superpower” of orthodox data privacy laws. These laws are based on the idea of personal data, which is broadly defined to mean essentially any information which is associated with, or may be associated with, an identifiable natural person.

Data privacy laws such as Europe’s General Data Protection Regulation (GDPR), plus 162 national statutes, act to restrain the collection, use, and disclosure of personal data. Generally speaking, these laws are technology neutral: they are blind to the manner in which personal data is collected.

This means that when the outputs of algorithms are personally identifiable, they fall within the scope of privacy laws in most places around the world. If personal data appears in a database by any means whatsoever, then it may be deemed to have been collected.

Data privacy laws therefore apply to personal data that’s generated by algorithms, even when it’s untouched by human hands.

[Update 18 August 2023: We have removed a reference to Australia’s Australia’s robodebt scandal. As a legal expert reminded us, robodebt involved a much simpler algorithm which didn’t involve AI, and in any event the lack of judgement was down to humans.]

Privacy laws shouldn’t come as a surprise

Time and time again the privacy implications of automated personal information flows seem to take technologists by surprise:

In 2011, German privacy regulators found that Facebook’s photo tag suggestions feature violated the law and called on the company to cease facial recognition and delete its biometric data. Facebook took the prudent approach of shutting down its facial recognition usage worldwide, and subsequently took many years to get it going again. See also my previous analysis with Anna Johnston which argues that tag suggestions are a form of personal data collection.
The counter-intuitive right to be forgotten (RTBF) first emerged as such in the 2014 European Court of Justice case Google Spain v AEPD and Mario Costeja Gonzálezi. The case was not about “forgetting” anything in general, but related specifically to de-indexing web search results. The narrow scope serves to highlight that personal data generated by algorithms, for that’s what search results are, is covered by privacy law. In my view, search results are not simple replicas of objective facts found in the public domain. They are the computed outcomes of complex big-data processes.

Earlier Lockstep articles have discussed how big-data processes breach data privacy principles when they mine changes in shopping habits to predict a customer may be pregnant, or correlate public family tree data with anonymised genomic research data to re-identify study participants.

While technologists may presume (or hope) that synthetic personal data escapes privacy laws, the general public would expect there to be limits on how information about them is generated behind their backs by computers.

Is AI the next privacy target?

The legal reality is straightforward. If an information system comes to hold personal data, by any means, then the organisation in charge of that system has collected personal data and is subject to data privacy laws.

As we have seen, analytics and big-data processes have been brought to heel by privacy laws. Artificial intelligence may be next.

Taking responsibility for simulated humans

LLMs are enabling radically realistic simulations of humans and interpersonal situations, with exciting applications in social science, behaviour change modelling, human resources, healthcare, and more. But as with many modern neural networks, the behaviour of the systems themselves can be unpredictable.

A recent study revealed “simulacra” (software agents) built on ChatGPT spontaneously exchanging personal information with each other, without being scripted by the software’s authors. That is, the agents were gossiping!

A strange feature of the current AI wave is that enormously powerful technologies are being released into the public arena on the basis that they will lead to massive net gains — such as discovering cures for cancer or correcting climate change — before being thoroughly tested in the lab. Moreover, it seems widely acknowledged that nobody knows exactly how they work.

As Bill Gates says, AI is the most powerful technology seen in decades. How, then, is it acceptable that its leaders and entrepreneurs can’t tell us what’s going on under the hood?

While arguments rage over the ethics of current AI, well-established privacy law shows that AI’s leaders might have to take more interest in their creations’ inner workings.

When an LLM acquires knowledge about identifiable people, whether by deep learning or gossip, then that’s personal data — and the people running the model are accountable for it under data privacy rules. And It doesn’t matter if the knowledge is distributed through an impenetrable neural network of parameters and weights buried in hidden layers.

Privacy law requires that any personal data created and held by an LLM must serve a clear purpose, accessible by the individuals to whom the personal data relates. The collection must be proportionate to that purpose. Personal data created in an LLM must not be used or disclosed for unrelated purposes, and in some jurisdictions the individuals concerned have the right to have the data erased.

I am not a lawyer, but I don’t believe that the owner of a deep-learning system that holds personal data can excuse themselves from technology-neutral privacy law just because they don’t know exactly how the data got there. Nor can they get around the right to erasure by appealing to the weird and wonderful ways that knowledge is encoded in neural networks.

Synthetic faces might as well be real faces

Lifelike images and videos of people can now be produced by generative AI programs, leading to a wide range of harms including fraud, revenge porn, and blackmail. A regulatory response to this problem may already be embedded in existing privacy laws.

Since digital images are made of data, a synthetic image of an actual person plainly constitutes personal data.

For reference, the GDPR defines personal data as any information “related to an identified or identifiable natural person”, and under the recently enacted California Privacy Rights Act (CPRA) personal information is that which “identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household” (emphasis added).

Therefore, data privacy laws restrain the very creation of synthetic facial images, as well as the use and disclosure of them. And some laws, such as Australia’s Privacy Act 1988, are even a little broader, with a definition of personal information explicitly including information that might not be true.

See also: How are we to defend anonymity provides some lateral thinking about how collection limitation rules could be used as a legal remedy against unauthorised re-identification.

Lockstep’s Data Verification Platform is a scheme to rationalise and organise data flows between data originators such as government and the risk owners who rely on accurate data to guide decisions. Join us in conversation.

If you’d like to follow the development of the Data Verification Platform model, please subscribe for email updates.