For the second time in as many months, a grave bug has emerged in core Internet security software. In February it was the "Goto Fail" bug in the Apple operating system iOS that left web site security inoperable; now we have "Heartbleed", a flaw that leaves many secure web servers in fact open to attackers sniffing memory contents looking for passwords and keys.
Who should care?
There is no shortage of advice on what to do if you're a user. And it's clear how to remediate the Heartbleed bug if you're a web administrator (a fix has been released). But what is the software fraternity going to do to reduce the incidence of these disastrous human errors? In my view, Goto Fail and Heartbleed are emblematic of chaotic software craftsmanship. It appears that goto statements are used with gay abandon throughout web software today, creating exactly the unmaintainable spaghetti code that the founders of Structured Programming warned us about in the 1970s. Testing is evidently lax; code inspection seems non-existent. The Heartbleed flaw is in a piece of widely used Open Source software, and was over-looked first by the programmer, and then by the designated inspector, and then it went unnoticed for two years in the wild.
What are the ramifications of Heartbleed?
"Heartbleed" is a flaw in an obscure low level feature of the "Transport Layer Security" (TLS) protocol. TLS has an optional feature dubbed "Heartbeat" which a computer connected in a secure session can use to periodically test if the other computer is still alive. Heartbeat involves sending a request message with some dummy payload, and getting back a response with duplicate payload. The bug in Heartbeat means the responding computer can be tricked into sending back a dump of 64 kiloytes of memory, because the payload length variable goes unchecked. (For the technically minded, this error is qualitatively similar to a buffer overload; see also the OpenSSL Project description of the bug). Being server memory used in security management, that random grab has a good chance of including sensitive TLS-related data like passwords, credit card numbers and even TLS session keys. The bug is confined to the OpenSSL security library, where it was introduced inadvertently as part of some TLS improvements in late 2011.
The flawed code is present in almost all Open Source web servers, or around 66% of all web servers worldwide. However not all servers on the Internet run SSL/TLS secure sessions. Security experts Netcraft run automatic surveys and have worked out that around 17% of all Internet sites would be affected by Heartbleed – or around half a million widely used addresses. These include many banks, financial services, government services, social media companies and employer extranets. An added complication is that the Heartbeat feature leaves no audit trail, and so a Heartbleed exploit is undetectable.
If you visit an affected site and start a secure ("padlocked") session, then an attacker that knows about Heartbleed can grab random pieces of memory from your session. Researchers have demonstrated that session keys can be retrieved, although it is said to be difficult. Nevertheless, Heartbleed has been described by some of the most respected and prudent commentators as catastrophic. Bruce Schneier for one rates its seriousness as "11 out of 10".
Should we panic?
No. The first rule in any emergency is "Don't Panic". But nevertheless, this is an emergency.
The risk of any individual having been harmed through Heartbleed is probably low, but the consequences are potentially grave (if for example your bank is affected). And soon enough, it will be simple and cheap to take action, so you will hear experts say 'it is prudent to assume you have been compromised' and to change your passwords.
However, you need to wait rather than rush into premature action. Until the websites you use have been fixed, changing passwords now may leave you more vulnerable, because it's highly likely that criminals are trying to exploit Heartbleed while they can. It's best to avoid using any secure websites for the time being. We should redouble the usual Internet precautions: check your credit card and bank statements (but not online for the time being!). Stay extra alert to suspicious looking emails not just from strangers but from your friends and colleagues too, for their cloud mail accounts might have been hacked. And seek out the latest news from your e-commerce sites, banks, government and so on. The Australian banks for instance were relatively quick to act; by April 10 the five biggest institutions confirmed they were safe.
Lessons for the Software Craft
Heartbleed for me is the last straw. I call it pathetic that mission critical code can harbour flaws like this. So for a start, in the interests of clarity, I will no longer use the term "Software Engineering". I've written a lot in the past about the practice and the nascent profession of programming but it seems we're just going backwards. I'm aware that calling programming a "craft" will annoy some people; honestly, I mean no offence to basket weavers.
I'm no merchant of doom. I'm not going to stop banking and shopping online (however I do refuse Internet facing Electronic Health Records, and I would not use a self-drive car). My focus is on software development processes and system security.
The modern world is increasingly dependent on software, so it passes understanding we still tolerate such ad hoc development processes.
The programmer responsible for the Heartbleed bug has explained that he made a number of changes to the code and that he "missed validating a variable" (referring to the unchecked length of the Heartbeat payload). The designated reviewer of the OpenSSL changes also missed that the length was not validated. The software was released into the wild in March 2012. It went unnoticed (well, unreported) until a few weeks ago and was rectified in an OpenSSL release on April 7.
I'd like to avoid apportioning individual blame, so I am not interested in the names of the programmer and the reviewer. But we have to ask: when so many application security issues boil down to overflow problems, why is it not second nature to watch out for bugs like Heartbleed? How did experienced programmers make such an error? Why was this flaw out in the wild for two years before it was spotted? I thought one of the core precepts of Open Source Software was that having many eyes looking over the code means that errors will be picked up. But code inspection seems not to be widely practiced anymore. There's not much point having open software if people aren't actually looking!
As an aside, criminal hackers know all about overflow errors and might be more motivated to find them than innocent developers. I fear that the Heartbleed overflow bug could have been noticed very quickly by hackers who pore over new releases looking for exactly this type of exploit, or equally by the NSA which is reported to have known about it from the beginning.
Where does this leave systems integrators and enterprise developers? Have they become accustomed to taking Open Source Software modules and building them in, without a whole lot of regression testing? There's a lot to be said for Free and Open Source Software (FOSS) but no enterprise can take "free" too literally; the total cost of development has to include reasonable specification, verification and testing of the integrated whole.
As discussed in the wake of Goto Fail, we need to urgently and radically lift coding standards.
'The widely publicised and very serious "gotofail" bug in iOS7 took me back ...
Early in my career I spent seven years in a very special software development environment. I didn't know it at the time, but this experience set the scene for much of my understanding of information security two decades later. I was in a team with a rigorous software development lifecycle; we attained ISO 9001 certification way back in 1998. My company deployed 30 software engineers in product development, 10 of whom were dedicated to testing. Other programmers elsewhere independently wrote manufacture test systems. We spent a lot of time researching leading edge development methodologies, such as Cleanroom, and formal specification languages like Z.
We wrote our own real time multi-tasking operating system; we even wrote our own C compiler and device drivers! Literally every single bit of the executable code was under our control. "Anal" doesn't even begin to describe our corporate culture.
Why all the fuss? Because at Telectronics Pacing Systems, over 1986-1990, we wrote the code for the world's first software controlled implantable defibrillator, the Guardian 4210.
The team spent relatively little time actually coding; we were mostly occupied writing and reviewing documents. And then there were the code inspections. We walked through pseudo-code during spec reviews, and source code during unit validation. And before finally shipping the product, we inspected the entire 40,000 lines of source code. That exercise took five people two months.
For critical modules, like the kernel and error correction routines, we walked through the compiled assembly code. We took the time to simulate the step-by-step operation of the machine code using pen and paper, each team member role-playing parts of the microprocessor (Phil would pretend to be the accumulator, Lou the program counter, me the index register). By the end of it all, we had several people who knew the defib's software like the back of their hand.
And we had demonstrably the most reliable real time software ever written. After amassing several thousand implant-years, we measured a bug rate of less than one in 10,000 lines.
The implant software team had a deserved reputation as pedants. Over 25 person years, the average rate of production was one line of debugged C per team member per day. We were painstaking, perfectionist, purist. And argumentative! Some of our debates were excruciating to behold. We fought over definitions of “verification” and “validation”; we disputed algorithms and test tools, languages and coding standards. We were even precious about code layout, which seemed to some pretty silly at the time.
Yet 20 years later, purists are looking good.
Last week saw widespread attention to a bug in Apple's iOS operating system which rendered website security impotent. The problem arose from a single superfluous line of code – an extra goto statement – that nullified checking of SSL connections, leaving users totally vulnerable to fake websites. The Twitterverse nicknamed the flaw #gotofail.
There are all sorts of interesting quality control questions in the #gotofail experience.
- Was the code inspected? Do companies even do code inspections these days?
- The extra goto was said to be a recent change to the source; if that's the case, what regression testing was performed on the change?
- How are test cases selected?
- For something as important as SSL, are there not test rigs with simulated rogue websites to stress test security systems before release?
There seems to have been egregious shortcomings at every level : code design, code inspection, and testing.
A lot of attention is being given to the code layout. The spurious goto is indented in such a way that it appears to be part of a branch, but it is not. If curly braces were used religiously, or if an automatic indenting tool was applied, then the bug would have been more obvious (assuming that the code gets inspected). I agree of course that layout and coding standards are important, but there is a much more robust way to make source code clearer.
Beyond the lax testing and quality control, there is also a software-theoretic question in all this that is getting hardly any attention: Why are programmers using ANY goto statements at all?
I was taught at college and later at Telectronics to avoid goto statements at all cost. Yes, on rare occasions a goto statement makes the code more compact, but with care, a program can almost always be structured to be compact in other ways. Don't programmers care anymore about elegance in logic design? Don't they make efforts to set out their code in a rigorous structured manner?
The conventional wisdom is that goto statements make source code harder to understand, harder to test and harder to maintain. Kernighan and Ritchie - UNIX pioneers and authors of the classic C programming textbook - said the goto statement is "infinitely abusable" and it "be used sparingly if at all." Before them, one of programming's giants, Edsger Djikstra, wrote in 1968 that "The go to statement ... is too much an invitation to make a mess of one's program"; see Go To Statement Considered Harmful. The goto creates spaghetti code. The landmark structured programming language PASCAL doesn't even have a goto statement! At Telectronics our coding standard prohibited without exception gotos in all implantable software.
Hard to understand, hard to test and hard to maintain is exactly what we see in the flawed iOS7 code. The critical bug never would have happened if Apple too banned the goto.
Now, I am hardly going to suggest that fanatical coding standards and intellectual rigor are sufficient to make software secure (see also "Security Isn’t Secure). It's unlikely that many commercial developers will be able to cost-justify exhaustive code walkthroughs when millions of lines are involved even in the humble mobile phone. It’s not as if lives depend on commercial software.
Or do they?!
Let’s leave aside that vexed question for now and return to fundamentals.
The #gotofail episode will become a text book example of not just poor attention to detail, but moreover, the importance of disciplined logic, rigor, elegance, and fundamental coding theory.
A still deeper lesson in all this is the fragility of software. Prof Arie van Deursen nicely describes the iOS7 routine as "brittle". I want to suggest that all software is tragically fragile. It takes just one line of silly code to bring security to its knees. The sheer non-linearity of software – the ability for one line of software anywhere in a hundred million lines to have unbounded impact on the rest of the system – is what separates development from conventional engineering practice. Software doesn’t obey the laws of physics. No non-trivial software can ever fully tested, and we have gone too far for the software we live with to be comprehensively proof read. We have yet to build the sorts of software tools and best practice and habits that would merit the title "engineering".
I’d like to close with a philosophical musing that might have appealed to my old mentors at Telectronics. We have reached a sort of pinnacle in post-modernism where the real world has come to pivot precariously on pure text. It is weird and wonderful that engineers are arguing about the layout of source code – as if they are poetry critics.
We have come to depend daily on great obscure texts, drafted not by people we can truthfully call "engineers" but by a largely anarchic community we would be better of calling playwrights.
That is, information security is not intellectually secure. Almost every precept of orthodox information security is ready for a shake-up. Infosec practices are built on crumbling foundations.
UPDATE: I've been selected to speak on this topic at the 2014 AusCERT Conference - the biggest information security event in Australasia.
The recent tragic experience of data breaches -- at Target, Snapchat, Adobe Systems and RSA to name a very few -- shows that orthodox information security is simply not up to the task of securing serious digital assets. We have to face facts: no amount of today's conventional security is ever going to protect assets worth billions of dollars.
Our approach to infosec is based on old management process standards (which can be traced back to ISO 9000) and a ponderous technology neutrality that overly emphasises people and processes. The things we call "Information Security Management Systems" are actually not systems that any engineer would recognise but instead are flabby sets of documents and audit procedures.
"Continuous security improvement" in reality is continuous document engorgement.
Most ISMSs sit passively on shelves and share drives doing nothing for 12 months, until the next audit, when the papers become the centre of attention (not the actual security). Audit has become a sick joke. ISO 27000 and PCI assessors have the nerve to tell us their work only provides a snapshot, and if a breach occurs between visits, it's not their fault. In their words they admit therefore that audits do not predict performance between audits. While nobody is looking, our credit card numbers are about as secure as Schrodinger's Cat!
The deep problem is that computer systems have become so very complex and so very fragile that they are not manageable by traditional means. Our standard security tools, including Threat & Risk Assessment and hierarchical layered network design, are rooted in conventional engineering. Failure Modes & Criticality Analysis works well in linear systems, where small perturbations have small effects, but IT is utterly unlike this. The smallest most trivial omission in software or in a server configuration can have dire and unlimited consequences. It's like we're playing Jenga.
Update: Barely a month after I wrote this blog, we heard about the "goto fail" bug in the Apple iOS SSL routines, which resulted from one spurious line of code. It might have been more obvious to the programmer and/or any code reviewer had the code been indented differently or if curly braces were used rigorously.
Security needs to be re-thought from the ground up. We need some bigger ideas.
We need less rigid, less formulaic security management structures, to encourage people at the coal face to exercise their judgement and skill. We need straight talking CISOs with deep technical experience in how computers really work, and not 'suits' more focused on the C-suite than the dev teams. We have to stop writing impenetrable hierarchical security policies and SOPs (in the "waterfall" manner we recognised decades ago fails to do much good in software development). And we need to equate security with software quality and reliability, and demand that adequate time and resources be allowed for the detailed work to be done right.
If we can't protect credit card numbers today, we urgently need to do things differently, standing as we are on the brink of the Internet of Things.
I am speaking at next week's AusCERT security conference, on how to make privacy real for technologists. This is an edited version of my conference abstract.
Privacy by Design is a concept founded by the Ontario Privacy Commissioner Dr. Ann Cavoukian. Dubbed "PbD", it's basically the same good idea as designing in quality, or designing in security. It has caught on nicely as a mantra for privacy advocates worldwide. The trouble is, few designers or security professionals can tell what it means.
Privacy continues to be a bit of a jungle for security practitioners. It's not that they're uninterested in privacy; rather, it's rare for privacy objectives to be expressed in ways they can relate to. Only one of the 10 or 11 or more privacy principles we have in Australia is ever labelled "security" and even then, all it will say is security must be "reasonable" given the sensitivity of the Personal Information concerned. With this legalistic language, privacy is somewhat opaque to the engineering mind; security professionals naturally see it as meaning little more than encryption and maybe some access control.
To elevate privacy practice from the personal plane to the professional, we need to frame privacy objectives in a way that generates achievable design requirements. This presentation will showcase a new methodology to do this, by extending the familiar standardised Threat & Risk Assessment (TRA). A hybrid Privacy & Security TRA adds extra dimensions to the information asset inventory. Classically an information asset inventory accounts for the confidentiality, integrity and availability (C.I.A.) of each asset; the extended methodology goes further, to identify which assets represent Personal Information, and for those assets, lists privacy related attributes like consent status, accessibility and transparency. The methodology also broadens the customary set of threats to include over-collection, unconsented disclosure, incomplete responses to access requests, over-retention and so on.
The extended TRA methodology brings security and privacy practices closer together, giving real meaning to the goal of Privacy by Design. Privacy and security are sometimes thought to be in conflict, and indeed they often are. We should not sugar coat this; after all, systems designers are of course well accustomed to tensions between competing design objectives. To do a better job at privacy, security practitioners need new tools like the Security & Privacy TRA to surface the requirements in an actionable way.
The hybrid Threat & Risk Assessment
TRAs are widely practiced during requirements analysis stages of large information systems projects. There are a number of standards that guide the conduct of TRAs, such as ISO 31000. A TRA first catalogues all information assets controlled by the system, and then systematically explores all foreseeable adverse events that threaten those assets. Relative risk is then gauged, usually as a product of threat likelihood and severity, and the set of threats to be prioritised according to importance. Threat mitigations are then considered and the expected residual risks calculated. An especially good thing about a formal TRA is that it presents management with the risk profile to be expected after the security program is implemented, and fosters consciousness of the reality that finite risks always remain.
The diagram below illustrates a conventional TRA workflow (yellow), plus the extensions to cover privacy design (red). The important privacy qualities of Personal Information assets include Accessibility, Permissibility (to disclose), Sensitivity (of e.g. health information), Transparency (of the reasons for collection) and Quality. Typical threats to privacy include over-collection (which can be an adverse consequence of excessive event logging or diagnostics), over-disclosure, incompleteness of records furnished in response to access requests, and over-retention of PI beyond the prima facie business requirement. When it comes to mitigating privacy threats, security practitioners may be pleasantly surprised to find that most of their building blocks are applicable.
The hybrid Security-Privacy Threat & Risk Assessment will help ICT practitioners put Privacy by Design into practice. It helps reduce privacy principles to information systems engineering requirements, and surfaces potential tensions between security practices and privacy. ICT design frequently deals with competing requirements. When engineers have the right tools, they can deal properly with privacy.
I have come to believe that a systemic conceptual shortfall affects typical technologists' thinking about privacy. It may be that engineers tend to take literally the well-meaning slogan that "privacy is not a technology issue". I say this in all seriousness.
Online, we're talking about data privacy, or data protection, but systems designers tend to bring to work a spectrum of personal outlooks about privacy in the human sphere. Yet what matters is the precise wording of data privacy law, like Australia's Privacy Act. To illustrate the difference, here's the sort of experience I've had time and time again.
During the course of conducting a PIA in 2011, I spent time with the development team working on a new government database. These were good, senior people, with sophisticated understanding of information architecture. But they harboured restrictive views about privacy. An important clue was the way they referred to "private" information rather than Personal Information (or equivalently, Personally Identifiable Information, PII). After explaining that Personal Information is the operable term in Australian legislation, and reviewing its definition from the Privacy Act, we found that the team had failed to appreciate the extent of the PI in their system. They overlooked that most of their audit logs collect PI, albeit indirectly and automatically. Further, they had not appreciated that information about clients in their register provided by third parties was also PI (despite it being intuitively "less private" by virtue of originating from others). I attributed these blind spots to the developers' weak and informal frame of "private" information. Online and in data privacy law alike, things are very crisp. The definition of Personal Information -- namely any data relating to an individual whose identity is readily apparent -- sets a low bar, embracing a great many data classes and, by extension, informatics processes. It's a nice analytical definition that is readily factored into systems analysis. After the team got that, the PIA in question proceeded apace and we found and rectified several privacy risks that had gone unnoticed.
Here are some more of the many recurring misconceptions I've noticed over the past decade:
- "Personal" Information is sometimes taken to mean especially delicate information such as payment card details, rather than any information pertaining to an identifiable individual such as email addresses in many cases; an exchange between US data breach analyst Jake Kouns and me over the Epsilon incident in 2011 is revealing of a technologists' systemically narrow idea of PII;
- the act of collecting PI is sometimes regarded only in relation to direct collection from the individual concerned; technologists can overlook that PI provided by a third party to a data custodian is nevertheless being collected by the custodian, and they can fail to appreciate that generating PI internally, through event logging for instance, can also represent collection
- even if they are aware of points such as Australia's Access and Correction Principle, database administrators can be unaware that, technically, individuals requesting a copy of information held about them should also be provided with pertinent event logs; a non-trivial case where individuals can have a genuine interest in reviewing event logs is when they want to know if an organisation's staff have been accessing their records.
These instances, among many others in my experience working across both information security and privacy, show that ICT practitioners suffer important gaps in their understanding. Security professionals in particular may be forgiven for thinking that most legislated Privacy Principles are legal niceties irrelevant to them, for generally only one of the principles in any given set is overtly about security; see:
- no. 5 of the eight OECD Privacy Principles
- no. 4 of the five Fair Information Practice Principles in the US
- no. 8 of the ten Generally Accepted Privacy Principles of the US and Canadian accounting bodies,
- no. 4 of the ten old National Privacy Principles of Australia, and
- no. 11 of the 13 new Australian Privacy Principles (APPs).
Yet every one of the privacy principles is impacted by information technology and security practices; see Mapping Privacy requirements onto the IT function, Privacy Law & Policy Reporter, Vol. 10.1& 10.2, 2003. I believe the gaps in the privacy knowledge of ICT practitioners are not random but are systemic, probably resulting from privacy training for non-privacy professionals being ad hoc and not properly integrated with their particular world views.
To properly deal with data privacy, ICT practitioners need to have privacy framed in a way that leads to objective design requirements. Luckily there already exist several unifying frameworks for systematising the work of dev teams. One example that resonates strongly with data privacy practice is the Threat & Risk Assessment (TRA).
The TRA is an infosec requirements analysis tool, widely practiced in the public and private sectors. There are a number of standards that guide the conduct of TRAs, such as ISO 31000. A TRA is used to systematically catalogue all foreseeable adverse events that threaten an organisation's information assets, identify candidate security controls (considering technologies, processes and personnel) to mitigate those threats, and most importantly, determine how much should be invested in each control to bring all risks down to an acceptable level. The TRA process delivers real world management decisions, understanding that non zero risks are ever present, and that no organisation has an unlimited security budget.
I have found that in practice, the TRA exercise is readily extensible as an aid to Privacy by Design. A TRA can expressly incorporate privacy as an attribute of information assets worth protecting, alongside the conventional security qualities of confidentiality, integrity and availability ("C.I.A."). A crucial subtlety here is that privacy is not the same as confidentiality, yet many frequently conflate the two. A fuller understanding of privacy leads designers to consider the Collection, Use, Disclosure and Access & Correction principles, over and above confidentiality when they analyse information assets.
Lockstep continues to actively research the closer integration of security and privacy practices.
I'm an ex software "engineer" [I have reservations about that term] with some life experience of ultra high rel development practices. It's fascinating how much about software quality I learned in the 1980s and 90s is relevant to info sec today.
I've had a trip down memory lane triggered by Karen Sandler's presentation at LinuxConf12 in Ballarat http://t.co/xvUkkaGl and her paper "Killed by code".
The software in implantable defibrillators
I'm still working my way through Karen Sandler's materials. So this post is a work in progress.
What's really stark on first viewing of Karen's talk is the culture she experienced and how it differs from the implantable defib industry I knew in its beginnings 25 years ago.
Karen had an incredibly hard and very off-putting time getting the company that made her defib to explain their software. But when we started in this field, every single person in the company -- and many of our doctors -- would have been able to answer the question What software does this defib run on?: the answer was "ours". And moreover, the FDA were highly aware of software quality issues. The whole medical device industry was still on edge from the notorious Therac 25 episode, a watershed in software verification.
A personal story
I was part of the team that wrote the code for the world's first software controlled implantable cadrioverter/defibrillator (ICD).
In 1990 Telectronics (a tragic legend of Australian technology) released the model 4210, which was just the fourth or fifth ICD on the market (the first few being hard-wired devices from CPI Inc. and Telectronics). The computing technology was severely restricted by several design factors, most especially ultra low power consumption, and a very limited number of microprocessor vendors that would warrant their chips for use in medical devices. The 4210 defib used a semi-customised 8 bit micro-controller based on the 6502, and a 32 KB byte-organised SRAM chip that held the entire executable. The micro clocked at 128kHz, fully eight times slower than the near identical micro in the Apple II a decade earlier. The software had to be efficient, not only to ensure it could make some very tough real time rendezvous, but to keep the power consumption down; the micro consumed about 30% of the device's power over its nominal five year lifetime.
We wrote mostly in C, with some assembly coding for the kernel and some performance sensitive routines. The kernel was of our own design, multi-tasking, with hard real time performance requirements (in particular, for obvious reasons the system had to respond within tight specs to heart beat interrupts and we had to show we weren't ever going to miss an interrupt!) We also wrote the C compiler.
The 4210's software was 40,000 lines of C, developed by a team of 5-6 over several years; the total effort was 25 person-years. Some of the testing and pre-release validation is described in my blog post about coding being like play writing. The final code inspection involved a team of five working five-to-six hour days for two months, reading aloud and understanding every single line. When occasion called for checking assembly instructions, sometimes we took turns with pencil and paper pretending to be the accumulators, the index registers, the program counter and so on. No stone was left unturned.
The final walk-through was quite a personnel management challenge. One of the senior engineers (a genius who also wrote our kernel and compiler) lobbied for inspecting the whole executable because he didn't want to rely on the correctness of the compiler -- but that would have taken six months. So we compromised by walking through only the assembly code for the critical modules, like the tachycardia detector and the interrupt handlers.
I mentioned that the kernel and compiler were home-grown. So this meant that the company controlled literally every single bit of code running in its defibs. And I reiterate we had several individuals who knew the source code end to end.
By the way, these days I will come across developers in the smartcard industry who find it hard working on applets that are 5 or 10 kilobytes small. Compare say a digital signing applet with an ICD, with its cardiac monitoring algorithms, treatment algorithms, telemetry controller, data logging and operating system all squeezed into 32KB.
We amassed several thousand implant-years of experience with the 4210 before it was superseded. After release, we found two or three minor bugs, which we fixed with software upgrades. None would have caused a misfire, neither false positive or false negative.
Yes, for the upgrade we could write into the RAM over a proprietary telemetry protocol. In fact the main reason for the one major software upgrade in the field was to add error correction because after hundreds of device-years we noticed higher than expected bit flips from natural background radiation. That's a helluva story in itself. It was observed that had the code been in ROM, we couldn't have changed it but we wouldn't have had to change it for bit flips either.
Morals of the story
Anyway, some of the morals of the story so far:
- Software then was cool and topical, and the whole company knew how to talk about it. The real experts -- the dozen or so people in Sydney directly involved in the development -- were all well known worldwide by the executives, the sales reps, the field clinical engineers, and regulatory affairs. And we got lots of questions (in contrast to Karen Sandler's experience where all the caridologists and company people said nobody ever asked about the code).
- Everything about the software was controlled by the company: the operating system, the chip platform, the compiler, the telemetry protocol.
- We had a team of people that knew the code like the backs of their hands. Better in fact. It was reliable and, in hindsight, impregnable. Not that we worried about malware back in 1987-1990.
Where has software development gone?
So the sorts of issues that Karen Sandler is raising now, over two decades on, are astonishing to me on so many levels.
- Why would anyone decide to write life support software on someone else's platform?
- Why would they use wifi or Bluetooth for telemetry?
- And if the medical device companies cut corners in software development, one wonders what the defense industry is doing with their drone flight controllers and other "smart" weaponry with its countless millions of lines of opaque software.
The software-as-a-profession debate continues largely untouched by each generation's innovations in production methods. Most of us in the 90s thought that formal methods, reuse and enforced modularity would introduce to software some of the hallmarks of real engineering: predictability, repeatability, measurability and quality. Yet despite Objected Oriented methods and sophisticated CASE tools, many of the human traits of software-as-a-craft remain with us.
The "software crisis" – the systemic inability to estimate software projects accurately, to deliver what's promised, and to meet quality expectations – is over 40 years old. Its causes are multiple and subtle, and despite ample innovation in languages, tools and methodologies, debates continue over what ails the software industry. The latest skirmish is a provocative suggestion from Forrester analyst Mark Gualtieri that we shift the emphasis from independent Quality Assurance back onto developers’ own responsibilities, or abandon QA altogether. There has been a strong reaction! The general devotion to QA I think is aided and abetted by today’s widespread fashion for corporate governance.
But we should listen to radical ideas like Gualtieri’s, rather than maintain a slavish devotion to orthodoxies. We should recognise that software engineering is inherently different from conventional engineering, because software itself is a different kind of material. Its properties make it less amenable to regular governance.
Simply, software does not obey the laws of physics. Building skyscrapers, tunnels, dams and bridges is relatively predictable. You start with site surveys and foundations, erect a sturdy framework and all sorts of temporary formers, flesh out the structure, fill in the services like power and plumbing, do the fit-out, and finally take away all the scaffolding.
Specifications for real engineering projects don’t change much, even over several years for really big initiatives. And the engineering tools don't change at all.
Software is utterly unlike this. You can start writing software anywhere you like, and before the spec is signed off. There aren’t any raw materials to specify and buy, and no quantity surveyors to deal with. Metaphorically speaking, the plumbing can go in before the framework. Hell, you don't even need a framework! Nothing physical holds a software system up. Flimsy software, as a material, is indistinguishable from the best code. No laws of physics dictate that you start at the bottom and work your way slowly upwards, using a symbiosis of material properties and gravity to keep your construction stable and well-behaved as it grows. The process of real world engineering is thick with natural constraints that ensure predictability (just imagine how wobbly a house would be if you could lay the bricks from the top down) whereas software development processes are almost totally arbitrary, except for the odd stricture imposed by high level languages.
Real world systems are thoroughly compartmentalised naturally. If a bearing fails in an air-conditioning plant in the basement, it’s not going to affect the integrity of any of the floor plates. On the other hand, nothing physically decouples lines of code; a bug in one part of the program can impinge on almost any other part (which incidentally renders traditional failure modes and effects analysis impossible). We only modularise software by artificial means, like banning goto statements and self-modifying code.
Coding is fast and furious. In a single day, a programmer can create a system probably more complex than an airport that takes more than 10,000 person-years to build. And software development is tremendous creative fun. Let's be honest: it's why the majority of programmers chose their craft in the first place.
Ironically the rapidity of programming contributes significantly to software project overruns. We only use software in information systems because it's faster to make and easier to modify than wired logic. So the temptation is irresistible to keep specs fluid and to accommodate new requirements at any time. Famously, the differences between prototype, beta and production product are marginal and arbitrary. Management and marketing take advantage of this fact, and unfortunately software engineers themselves yield too readily to the attraction of the last minute tweak.
I suggest programming is more like play writing than engineering, and many programmers (especially the really good ones!) are just as manageable as poets.
In both software and play writing, structure is almost entirely arbitrary. Because neither obey the laws of physics, the structure of software and plays comes from the act of composition. A good software engineer will know their composition from end to end. But another programmer can always come along and edit the work, inserting their own code as they see fit. It is received wisdom in programming that most bugs arise from imprudent changes made to old code.
Messing with a carefully written piece of software is fraught with danger, just as it is with a finished play. I could take Hamlet for instance, and hack it as easily as I might hack an old program -- add a character or two, or a whole new scene -- but the entire internal logic of the play would almost certainly be wrecked. It would be “buggy”.
I was a software development manager for some years in the cardiac pacemaker industry. We developed the world’s first software controlled automatic implantable defibrillator. It had several tens of thousands of lines of C, developed at a rate of about one tested line of code per person per day. We quantified it as the most reliable real time software ever written at the time.
I believe the outstanding quality resulted from a handful of special grassroots techniques:
- We had independent software test teams that developed their own test cases and tools
- We did obsessive source code inspections on all units before integration. And in the end, before we shipped, we did an end-to-end walkthrough of the frozen software. It took six of us two months straight. So we had several people who knew the entire object intimately.
- We did early informal design reviews. As team leader, I favoured having my developers do a whiteboard presentation to the team of their early design ideas, no more than 48 hours after being given responsibility for a module. This helped prevent designers latching onto misconceptions at the formative stages.
- We took our time. I was concerned that the CASE tools we introduced in the mid 90s might make code rather too easy to trot out, so at the same time I set a new rule that developers had toturn their workstations off for a whole day once a week, and work with pen and paper.
- My internal coding standard included a requirement that when starting a new module, developers write their comments before they write their code, and their comments had to describe ‘why’ not ‘what’. Code is all syntax; the meaning and intent of any software can only be found in the natural language comments.
Code is so very unlike the stuff of other professions – soil and gravel, metals and alloys, nuts and bolts, electronics, even human flesh and blood - that the metaphor of engineering in the phrase “software engineering” may be dangerously misplaced. By coopting the term we have might have started out on the wrong foot, underestimating the fundamental challenge of forging a software profession. It won't be until software engineering develops the normative tools and standards, culture and patience of a true profession that the software crisis will turn around. And then corporate governance will have something to govern in software development.
Update September 2012
The recent discovery that junk DNA is not actually junk rather reinforces my long standing thesis, espoused below, that we don't know enough about how genes work to be able to validate genetic engineering artifacts by testing alone. I point out that computer programs are only validated by a mixture of testing, code inspection and theory, all of which is based on knowing how the code works at the instruction level. But we don't have a terribly complete picture of how genes interact. We always knew they were massively parallel, and now it turns out that junk DNA has some sort of role in gene expression across the whole of the genome, raising the combinatorial complexity enormously. This tells me that we have little idea how modifications at one point in the genome can impact the functioning at any number of other points (but it hints at an explanation as to why human beings are so much more complex than nematodes despite having only a relatively small number of additional raw genetic instructions).
And now there is news that a cow in New Zealand, genetically engineered in respect of one allergenic protein, was born with no tail. Now it's too early to be able to blame the GM for this oddity, but equally, the junk DNA finding surely undermines the confidence that any genetic engineer can have in predicting that their changes cannot have had unexpected and really unpredictable side effects.
Original post, 15 Jan 2011
As a software engineer years ago I developed a deep unease about genetic engineering and genetically modified organisms (GM). The software experience suggests to me that GM products cannot be verifiable given the state of our knowledge about how genes work. I’d like to share my thoughts.
Genetic engineering proponents seem to believe the entire proof of a GM pudding is in the eating. That is, if trials show that GM food is not toxic, then it must be safe, and there isn't anything else to worry about. The lesson I want others to draw from the still new discipline of software engineering is there is more to the verification of correctness in complex programs than tesing the end product.
Recently I’ve come across an Australian government-sponsored FAQ Arguments for and against gene technology (May 2010) that supposedly provides a balanced view of both sides of the GM debate. Yet it sweeps important questions under the rug.[At one point the paper invites readers to think about whether agriculture is natural. It’s a highly loaded question grounded in the soothing proposition that GM is simply an extension of the age old artificial selection that gave us wheat, Merinos and all those different potatoes. The question glosses over the fact that when genes recombine under normal sexual reproduction, cellular mechanisms constrain where each gene can end up, and most mutations are still-born. GM is not constrained; it jumps levels. It is quite unlike any breeding that has gone before.]
Genes are very frequently compared with computer software, for good reason. I urge that the comparison be examined more closely, so that lessons can be drawn from the long standing “Software Crisis”.
Each gene codes for a specific protein. That much we know. Less clear is how relatively few genes -- 20,000 for a nematode; 25,000 for a human being -- can specify an entire complex organism. Science is a long way from properly understanding how genes specify bodies, but it is clear that each genome is an immensely intricate ensemble of interconnected biochemical short stories. We know that genes interact with each other, turning each other on and off, and more subtly influencing how each is expressed. In software parlance, genetic codes are executed in a massively parallel manner. This combinatorial complexity is probably why I can share fully half of my genes with a turnip, and have an “executable file” in DNA that is only 20% longer than that of a worm, and yet I can be so incredibly different from those organisms.
If genomes are like programs then let’s remember they have been written achingly slowly over eons, to suit the circumstances of a species. Genomes are revised in a real world laboratory over billions of iterations and test cases, to a level of confidence that software engineers can’t even dream of. Brassica napus.exe (i.e. canola) is at v1000000000.1. Tinkering with isolated parts of this machinery, as if it were merely some sort of wiki with articles open to anyone to edit, could have consequences we are utterly unable to predict.
In software engineering, it is received wisdom that most bugs result from imprudent changes made to existing programs. Furthermore, editing one part of a program can have unpredictable and unbounded impacts on any other part of the code. Above all else, all but the very simplest software in practice is untestable. So mission critical software (like the implantable defibrillator code I used to work on) is always verified by a combination of methods, including unit testing, system testing, design review and painstaking code inspection. Because most problems come from human error, software excellence demands formal design and development processes, and high level programming languages, to preclude subtle errors that no amount of testing could ever hope to find.
How many of these software quality mechanisms are available to genetic engineers? Code inspection is moot when we don’t even know how genes normally interact with one another; how can we possibly tell by inspection if an artificial gene will interfere with the “legacy” code?
What about the engineering process? It seems to me that GM is akin to assembly programming circa 1960s. The state-of-the-art in genetic engineering is nowhere near even Fortran, let alone modern object oriented languages.
Can today’s genetic engineers demonstrate a rigorous verification regime, given the reality that complex software programs are inherently untestable?
We should pay much closer attention to the genes-as-software analogy. Some fear GM products because they are unnatural; others because they are dominated by big business and a mad rush to market. I simply say let’s slow down until we’re sure we know what we're doing.