I've just completed a major new Constellation Research report looking at how today's privacy practices cope with Big Data. The report draws together my longstanding research on the counter-intuitive strengths of technology-neutral data protection laws, and melds it with my new Constellation colleagues' vast body of work in data analytics. The synergy is honestly exciting and illuminating.
Big Data promises tremendous benefits for a great many stakeholders but the potential gains are jeopardised by the excesses of a few. Some cavalier online businesses are propelled by a naive assumption that data in the "public domain" is up for grabs, and with that they often cross a line.
For example, there are apps and services now that will try to identify pictures you take of strangers in public, by matching them biometrically against data supersets compiled from social networking sites and other publically accessible databases. Many find such offerings quite creepy but they may be at a loss as to what to do about it, or even how to think through the issues objectively. Yet the very metaphor of data mining holds some of the clues. If, as some say, raw data is like crude oil, just waiting to be mined and exploited by enterprising prospecters, then surely there are limits, akin to mining permits?
Many think the law has not kept pace with technology, and that digital innovators are free to do what they like with any data they can get their hands on. But technologists repreatedly underestimate the strength of conventional data protection laws and regulations. The extraction of PII from raw data may be interpreted under technology neutral privacy principles as an act of Collection and as such is subject to existing statutes. Around the world, Google thus found they are not actually allowed to gather Personal Data that happens to be available in unencrypted Wi-Fi transmission as StreetView cars drive by homes and offices. And Facebook found they are not actually allowed to automatically identify people in photos through face recognition without consent. And Target probably would find, if they tried it outside the USA, that they cannot flag selected female customers as possibly pregnant by analysing their buying habits.
On the other hand, orthodox privacy policies and static user agreements do not cater for the way personal data can be conjured tomorrow from raw data collected today. Traditional privacy regimes require businesses to spell out what personally identifiable information (PII) they collect and why, and to restrict secondary usage. Yet with Big Data, with the best will in the world, a company might not know what data analytics will yield down the track. If mutual benefits for business and customer alike might be uncovered, a freeze-frame privacy arrangement may be counter-productive.
Thus the fit between data analytics and data privacy standards is complex and sometimes surprising. While existing laws are not to be underestimated, we do need something new. As far as I know it was Ray Wang in his Harvard Business Review blog who first called for a fresh privacy compact amongst users and businesses.
The spirit of data privacy is simply framed: organisations that know us should respect the knowledge they have, they should be open about what they know, and they should be restrained in what they do with it. In the Age of Big Data, let's have businesses respect the intelligence they extract from data mining, just as they should respect the knowledge they collect directly through forms and questionnaires.
I like the label "Big Privacy"; it is grandly optimistic, like "Big Data" itself, and at the same time implies a challenge to do better than regular privacy practices.
Ontario Privacy Commissioner Dr Ann Cavoukian writes about Big Privacy, describing it simply as "Privacy By Design writ large". But I think there must be more to it than that. Big Data is quantitatively but also qualitatively different from ordinary data analyis.
To summarise the basic elements of a Big Data compact:
- Respect and Restraint: In the face of Big Data’s temptations, remember that privacy is not only about what we do with PII; just as important is what we choose not to do.
- Super transparency: Who knows what lies ahead in Big Data? If data privacy means being open about what PII is collected and why, then advanced privacy means going further, telling people more about the business models and the sorts of results data mining is expected to return.
- Engage customers in a fair deal for PII: Information businesses ought to set out what PII is really worth to them (especially when it is extracted in non-obvious ways from raw data) and offer a fair "price" for it, whether in the form of "free" products and services, or explicit payment.
- Really innovate in privacy: There’s a common refrain that “privacy hampers innovation” but often that's an intellectually lazy cover for reserving the right to strip-mine PII. Real innovation lies in business practices which create and leverage PII while honoring privacy principles.
My report, "Big Privacy" Rises to the Challenges of Big Data may be downloaded from the Constellation Research website.
For the second time in as many months, a grave bug has emerged in core Internet security software. In February it was the "Goto Fail" bug in the Apple operating system iOS that left web site security inoperable; now we have "Heartbleed", a flaw that leaves many secure web servers in fact open to attackers sniffing memory contents looking for passwords and keys.
Who should care?
There is no shortage of advice on what to do if you're a user. And it's clear how to remediate the Heartbleed bug if you're a web administrator (a fix has been released). But what is the software fraternity going to do to reduce the incidence of these disastrous human errors? In my view, Goto Fail and Heartbleed are emblematic of chaotic software craftsmanship. It appears that goto statements are used with gay abandon throughout web software today, creating exactly the unmaintainable spaghetti code that the founders of Structured Programming warned us about in the 1970s. Testing is evidently lax; code inspection seems non-existent. The Heartbleed flaw is in a piece of widely used Open Source software, and was over-looked first by the programmer, and then by the designated inspector, and then it went unnoticed for two years in the wild.
What are the ramifications of Heartbleed?
"Heartbleed" is a flaw in an obscure low level feature of the "Transport Layer Security" (TLS) protocol. TLS has an optional feature dubbed "Heartbeat" which a computer connected in a secure session can use to periodically test if the other computer is still alive. Heartbeat involves sending a request message with some dummy payload, and getting back a response with duplicate payload. The bug in Heartbeat means the responding computer can be tricked into sending back a dump of 64 kiloytes of memory, because the payload length variable goes unchecked. (For the technically minded, this error is qualitatively similar to a buffer overload; see also the OpenSSL Project description of the bug). Being server memory used in security management, that random grab has a good chance of including sensitive TLS-related data like passwords, credit card numbers and even TLS session keys. The bug is confined to the OpenSSL security library, where it was introduced inadvertently as part of some TLS improvements in late 2011.
The flawed code is present in almost all Open Source web servers, or around 66% of all web servers worldwide. However not all servers on the Internet run SSL/TLS secure sessions. Security experts Netcraft run automatic surveys and have worked out that around 17% of all Internet sites would be affected by Heartbleed – or around half a million widely used addresses. These include many banks, financial services, government services, social media companies and employer extranets. An added complication is that the Heartbeat feature leaves no audit trail, and so a Heartbleed exploit is undetectable.
If you visit an affected site and start a secure ("padlocked") session, then an attacker that knows about Heartbleed can grab random pieces of memory from your session. Researchers have demonstrated that session keys can be retrieved, although it is said to be difficult. Nevertheless, Heartbleed has been described by some of the most respected and prudent commentators as catastrophic. Bruce Schneier for one rates its seriousness as "11 out of 10".
Should we panic?
No. The first rule in any emergency is "Don't Panic". But nevertheless, this is an emergency.
The risk of any individual having been harmed through Heartbleed is probably low, but the consequences are potentially grave (if for example your bank is affected). And soon enough, it will be simple and cheap to take action, so you will hear experts say 'it is prudent to assume you have been compromised' and to change your passwords.
However, you need to wait rather than rush into premature action. Until the websites you use have been fixed, changing passwords now may leave you more vulnerable, because it's highly likely that criminals are trying to exploit Heartbleed while they can. It's best to avoid using any secure websites for the time being. We should redouble the usual Internet precautions: check your credit card and bank statements (but not online for the time being!). Stay extra alert to suspicious looking emails not just from strangers but from your friends and colleagues too, for their cloud mail accounts might have been hacked. And seek out the latest news from your e-commerce sites, banks, government and so on. The Australian banks for instance were relatively quick to act; by April 10 the five biggest institutions confirmed they were safe.
Heartbleed for me is the last straw. I call it pathetic that mission critical code can harbour flaws like this. So for a start, in the interests of clarity, I will no longer use the term "Software Engineering". I've written a lot in the past about the practice and the nascent profession of programming but it seems we're just going backwards. I'm aware that calling programming a "craft" will annoy some people; honestly, I mean no offence to basket weavers.
I'm no merchant of doom. I'm not going to stop banking and shopping online (however I do refuse Internet facing Electronic Health Records, and I would not use a self-drive car). My focus is on software development processes and system security.
The modern world is increasingly dependent on software, so it passes understanding we still tolerate such ad hoc development processes.
The programmer responsible for the Heartbleed bug has explained that he made a number of changes to the code and that he "missed validating a variable" (referring to the unchecked length of the Heartbeat payload). The designated reviewer of the OpenSSL changes also missed that the length was not validated. The software was released into the wild in March 2012. It went unnoticed (well, unreported) until a few weeks ago and was rectified in an OpenSSL release on April 7.
I'd like to avoid apportioning individual blame, so I am not interested in the names of the programmer and the reviewer. But we have to ask: when so many application security issues boil down to overflow problems, why is it not second nature to watch out for bugs like Heartbleed? How did experienced programmers make such an error? Why was this flaw out in the wild for two years before it was spotted? I thought one of the core precepts of Open Source Software was that having many eyes looking over the code means that errors will be picked up. But code inspection seems not to be widely practiced anymore. There's not much point having open software if people aren't actually looking!
As an aside, criminal hackers know all about overflow errors and might be more motivated to find them than innocent developers. I fear that the Heartbleed overflow bug could have been noticed very quickly by hackers who pore over new releases looking for exactly this type of exploit, or equally by the NSA which is reported to have known about it from the beginning.
Where does this leave systems integrators and enterprise developers? Have they become accustomed to taking Open Source Software modules and building them in, without a whole lot of regression testing? There's a lot to be said for Free and Open Source Software (FOSS) but no enterprise can take "free" too literally; the total cost of development has to include reasonable specification, verification and testing of the integrated whole.
As discussed in the wake of Goto Fail, we need to urgently and radically lift coding standards.