PWN Consulting

I would like to discuss the issue of creating stable code in this article. Because this is a huge topic, enough to fill an entire book on its own, I will keep to the core ideas and keep the article brief.

There are two main paradigms in the pursuit of creating stable software. The one is based on creating a complex, unified system which is designed to have no memory leaks, no bugs, must run forever and have a high capacity. The other is based on the model where processes are small, highly focussed on doing only one task and doing it extremely fast and effective, memory leaks does not really matter and they do not run forever. They get started and die as they are needed. Hence the motivation that memory leaks and some type of bugs are not critical. Microsoft SQL Server is an example of the former, and Unix's Apache Web Server is an example of the latter.

I am going to discuss some guidelines on writing stable software using the first mentioned paradigm, as this paradigm is wide in use and is the most difficult to implement correctly.

The first issue I want to discuss is that of exceptions and errors. But maybe a definition of those terms is now in order. An error refers to the origin of a problem found in the execution of a process which forces unexpected program flow, and an exception refers to a the encapsulation of an error into a manageable exception object that can be thrown by the code that incurred the error, and be caught in a higher level which can deal with the problem. As an example, an error could be a division by zero operation, then the system (or the software itself) will throw an exception for this error so that it can be handled properly. The higher level exception handler might then report the error to the user, or do whatever is applicable.

Sometimes it is useful to keep a list of objects in memory. An example can be a list of connection objects of TCP/IP connections to a server. When data is received, a function can enumerate through this list and find the correct connection object by matching the socket address, for example. It will then do whatever it needs to do on this connection object. In large systems, it is better to use a methodology called reference counting instead of classical create/destroy methods. It is the same principle on which COM objects work in the Windows environment. For a discussion on a custom written object that is a descendant of TObject (Delphi/Kylix) and is based on reference counting, see Rudolph van Graan's web site. This object, called TLockableObject, exposes the same functionality and behavior as the TObject class, but adds reference counting to it.

Now, assume you have lots of TConnection objects, which is derived ultimately from TLockableObject. Assume in the code handling new connections, you create a new connection object and adds it to the list. The reference count will be 1. Lets say later in the code that handles received data from the port, this list is enumerated for the applicable connection object and then this connection object is submitted to a kernel which will asynchronously process the request. Obviously we need to increment the reference count when we submit this connection object to the kernel, as we need to make sure that the connection object will not be freed on a unexpected disconnect event and the kernel hence accessing a freed object (causing a memory read access violation). The reference count is now 2. Can you see that if the system is large and complex, it would be necessary to not make any assumptions about the validity of the objects in the list. Each time the list is enumerated in search for a connection object, a check must be added to verify the connection object is not nil (which would be the case when the reference count goes to 0 and the correct methods had been used to set the object's value to nil).

for (int i=0;i<mConnectionList->count();i++) {
  vObj = mConnectionList->GetObject(i);
  if (vObj != NULL)
    // then do something...
  else
    DoElse();
}

Now in many of the code I have reviewed and read, I saw the implementation for DoElse() look like this:

for (int i=0;i<mConnectionList->count();i++) {
  vObj = mConnectionList->GetObject(i);
  if (vObj != NULL)
    // then do something...
  else
    continue;
}

You may ask "But what is wrong with that? If the connection object retrieved is NULL, we must just make sure we do not work with it.". Well, the problem with that piece of code lies deeper than a casual inspection would reveal. A system had been designed in a company I used to work for that managed lots of these type of connections and other objects, all being passed around using reference counting and being stored in lists. It had very, very weird and inconsistent behaviors in which sometimes a connection just would not show up in the connection list, or sometimes it would not change state correctly etc. Also, it had a subtle memory leak that under certain conditions, would be as bad as leaking 200MB in a week's time. For a system designed to run forever, this is obviously totally unacceptable! The problem was found to be a combination of several different things, under which the abovementioned issue was one. Like Rudolph explains in his web site, it is very important to evaluate the else condition. It makes you think deeper about what your application does and should do in case of unexpected behavior. In this case the easiest way out of this problem was to use a continue statement. In 99% of the cases this should work fine, and if there was no bugs or programming errors in the code, it should work 100% of the time because the else would never get executed. I call this method Silent error suppression. This means handling error conditions in such a way that the system keeps on working in a transparent way; the error had been suppressed.

As any good developer will know, us humans cannot write bug free code. In my humble opinion this is impossible, as the amount of external influences are too great to cater for all conditions. An extreme example - you have crafted the perfect system. You have a for loop that does something. You assume (quite correctly) that the index would increment by one through each iteration. Now there is a power brownout in the system which runs your application. Some bits in memory were destroyed. Accidentally also one bit in the memory location that holds the index variable. Now suddenly your variable is greater than the bound condition, and you access memory in an invalid array index. Boom! The program crashes. Okay - I agree, this is an extreme situation and one should not try to cater for these type of situations simply because it would cause absolute code bloat and would not be really effective in any case. But the idea sticks - one cannot write 100% robust, stable and bugfree code.

The point I am trying to make is simple - because we cannot write bugfree code, one cannot assume that "it should work 100% of the time". It will work 99% of the time. It is (quite) possible that somewhere you made a mistake and freed the connection object twice. This loop would suppress the error, hence helping you not to know about the bug you introduced elsewhere. See how this methodology helps in preventing writing stable, robust code?

I have a different attitude towards these kind of situations. I would rewrite that loop as:

for (int i=0;i<mConnectionList->count();i++) {
  vObj = mConnectionList->GetObject(i);
  if (vObj != NULL)
    // then do something...
  else
    throw exception("Internal error #123 encountered: "+
             "vObj = NULL in method XX::YY()");
}

See how this would cause the program to alert about the error? A higher level layer could then inform the user about this condition, and terminate the program gracefully. This bug can then be submitted to the developer and it can be corrected. This is in contrast to the belief where this is a recoverable situation. In this case, it is not. If that object is nil, then although your program will seem to keep on running fine, it will have an inconsistent state. If you scream at all inconsistent (should never happen) situations, then you will be able to see much faster when a bug is present, and also have a good idea of where it affects the system. Note that I do not mean normal exceptional conditions, like when parsing a file for example, and you expect each record to be on the following line, but encountered one empty line. This would most probably not indicate a failure and merely adds a bit of adaptability to your code if you then merely skip all blank lines and therefore silently suppressing the normal error condition.

Another common mistake novice (and sometimes experienced) developers makes is illustrated in the following piece of code:

unsigned long faculty(int n)
{
  if (n <= 1)
    return 1;
  else
    return n * faculty(n-1);
}

...
unsigned long n_fact;
int y;

y = g(x);

n_fact = faculty(y);

In this piece of code, we have a function that calculates the faculty of n, i.e. n!. We also have some code that assigns a value for y based on the function g(x), which can do anything as long as it returns an integer. As you can see, this code will work correctly for valid values of n, where valid is defined as n = integer, n >= 0. So we need to add some parameter validation checking. Lets change the main code to:

...

unsigned long n_fact;
int y;

y = g(x);

if (y >= 0)
  n_fact = faculty(y);
else
  throw exception("The faculty of a negative number cannot be taken");

This code should now work perfectly and negative values of y will not break the code. But there is a huge problem with this method, and this problem is the root cause for many unstable and buggy systems. The assumption made here is that you are the only user of the function faculty() and that you know how it works inside. Therefore you know which values you may and may not pass to that function. But as it normally happens, those small functions we write become very useful, and get reused very frequently. Also, later it is not only you (the author) who uses it, but lots of other people as well. The problem is that they should not and do not have to know the details of how the faculty() function works - the only thing they need to know is that it will calculate n! and what happens on wrong input.

Can you see that everywhere this function is to be used, it would be necessary to add a check or two on the validity of the data before passing it to the faculty function? This is wrong. The validation checks should be an integral part of the smallest building blocks of a system - starting with those small functions. If I change the code like this:

unsigned long faculty(int n)
{
  if (n < 0)
    throw emaths("n cannot be negative");
  else if (n <= 1)
    return 1;
  else
    return n * faculty(n-1);
}

...

unsigned long n_fact;
int y;

y = g(x);

try {
  n_fact = faculty(y);
}
catch (emaths& E) {
  // Do something about the error
}

Immediately the burden of doing parameter validation checking is removed from the code which uses the function, and moved into the function itself where it can be tested, debugged and reused. Now all the person needs to know about the function, is that it returns the faculty of n for valid n, and throws an emaths exception on error. The faculty() function is now self contained. Of course a better solution would have been to declare n to be of type unsigned instead...

Although I wanted to stay away from OS specific or language specific issues, I want to discuss a brilliant technique one can use in C++ to ensure proper releasing of allocated memory for objects. But first, an Object Pascal example of how to do proper object allocation/destroying:

procedure DoSomething(const pRegKey: String);
var
  vRegistry: TRegistry;

begin
  vRegistry := TRegistry.Create(pRegKey);
  if (vRegistry = nil) then
    raise Exception.Create('Exception in DoSomething() - vRegistry = nil');

  try
    // Do something that could cause an exception to be thrown
  finally
    vRegistry.Free;
  end;
end;

In the above function, one can always be sure that vRegistry will always be freed - even if an exception occurs. This is made possible using the try-finally construct of Object Pascal. C++ does not have this construct. A first attempt might look like this:

void DoSomething(std::string pRegKey)
{
  TRegistry *vRegistry;

  vRegistry = new TRegistry(pRegKey);
    // new will throw an exception on failure

  try {
    // Do something that could cause an exception to be thrown
  }
  catch (...) {
    delete vRegistry;
    throw;
  }
  delete vRegistry;
}

If the code throws an exception in the try-catch block, the exception would be caught, the vRegistry object freed and the exception would be re-thrown. If no exception occurs, the vRegistry object will still be freed. This method is verbose and error prone. There is a very elegant solution to this problem specific to C++ (and any language that supports operator overloading). It is called Resource acquisition is initialization, and is described by Bjarne Stroustrup - the creator of C++. The aforementioned example can be rewritten as:

class TReg {
private:
  TRegistry* mReg;

public:
  TReg(const string& pRegKey) : mReg(NULL)
    {mReg = new TRegistry(pRegKey);}
  ~TReg() {delete mReg;}
  operator TRegistry*() {return mReg;}
  TRegistry* operator ->() {return mReg;}
};

void DoSomething(std::string pRegKey)
{
  TReg vRegistry(pRegKey);

  try {
    // Do something that could cause an exception to be thrown
  }
  catch (...) {
    // Handle exception
  }
}

Can you see that in this code we create an instance of TReg, then use it and forget about it? The reason we are allowed to do this is because vRegistry is now a local variable that will be destroyed when the function returns. When a class gets destroyed, the destructor will always be called.

To step through the code, start with the line TReg vRegistry(pRegKey);. In that line, a new object of type TReg will be instantiated with the parameter pRegKey being passed to the constructor. In the constructor, a new instance of the actual TRegistry class will be instantiated. The two overloaded operators allow one to use vReg as if it was an instance of TRegistry. Hence if TRegistry had a method called ReadKey(), I could use the following syntax in the function DoSomething(): vRegistry->ReadKey();. Finally, whenever the function returns (no matter whether it is due to normal flow or an exception), the vRegistry object will be destroyed because of it going out of scope. When that happens, the destructor will be called automatically and the instance of TRegistry will be freed. See how this method brilliantly solves the problem of acquiring and releasing resources? Here is a general template that you may use in your code:

namespace RAiI {
  template<class T> class TResource {
    private:
      T* mData;
      int mCount;

    public:
      TResource() : mData(NULL), mCount(0) {mData = new T;}
      TResource(int n) : mData(NULL), mCount(n)
        {if (n == 0) mData = new T; else mData = new T[n];}
      ~TResource() {if (mCount == 0) delete mData; else delete [] mData;}
      operator T*() {return mData;}
      T* operator ->() {return mData;}
  };
};

...

// Use it as:
RAiI::TResource<TResourceToProtect> vResource;
// or...
RAiI::TResource<TResourceToProtect> vResource(vSizeOfArray);
// and access the member functions and data members of
// TResourceToProtect with the -> operator
vResource->SomeMemberFunctionOfTResourceToProtect();

Writing Stable Code Sept. 23, 2003, 1 a.m.