Why do things go wrong?

In discussions about software safety you often end up arguing about something that is actually about a fault model of software.

In this post I would like to try to sketch such a fault model.

The most obvious is logic faults: the software functionally does not what it is intended to do. We could also call this fault class algorithmic faults. As examples I see faults as division by zero, uninitialized variables and faults in e.g. state machines.

The next and very bad ones are memory management faults. That are wild pointers (pointing to something invalid), dangling pointers (usage after they point to something valid), buffer overflows and misunderstanding about memory ownership.

Recently, we also stumbled over an issue created by integer promotion that was not obvious. This represents another class of potential issues. It may be specific to C, however I guess an even more type safe language may have the same problems, e.g. when casting types.

Introducing parallel processing (e.g. multi-threaded programming, usage of multi-core or even interrupt handling) creates two new classes of potential faults in software: data consistency problems and locking issues (life-lock, dead-lock).

Faults that should not be considered in discussions on software faults are reliance on implementation-specific behavior (they can be prevented by static code analysis or better: don’t do it!) or hardware faults (e.g. single event upsets). Software can detect such hardware faults, but they are not caused by software.

Maybe I one time add a page here to describe those faults in detail and think more about completeness.