LogSource &src; The Mutexk constructor and destructor are implemented as follows: _lock(stheKey); tendif MutexLock::~MutexLock()//Destructor pthread mutex unlock(sthekey); #if defined(DEBUG iec0aiy destroyed at<sre.fe< tendif The Mutexok implementation makes use of aLooue object that has not been discussed yet.The corstructed Weing crrors and tracode the the object w sary to sp 近th cation of the number.Our developers chose to encapsulate both in,we had a do-nothing base class followed by a more useful derived class: ource( ae89sog810g80ure10 class LogSource public BaseLogSource int lineNum; 1 The Logsource object was created and passed as an argument to the MutexLock object constructor.The 2
12 LogSource &src; }; The MutexLock constructor and destructor are implemented as follows: MutexLock::MutexLock(pthread_mutex_t& aKey, const LogSource& source) : BaseLock(aKey, source), theKey(aKey), src(source) { pthread_mutex_lock(&theKey); #if defined(DEBUG) cout << "MutexLock " << &aKey << " created at " << src.file() << "line" <<src.line() << endl; #endif } MutexLock::~MutexLock() // Destructor { pthread_mutex_unlock(&theKey); #if defined(DEBUG) cout << "MutexLock " << &aKey << " destroyed at " << src.file()<< "line" << src.line() << endl; #endif } The MutexLock implementation makes use of a LogSource object that has not been discussed yet. The LogSource object is meant to capture the filename and source code line number where the object was constructed. When logging errors and trace information it is often necessary to specify the location of the information source. A C programmer would use a (char *) for the filename and an int for the line number. Our developers chose to encapsulate both in a LogSource object. Again, we had a do-nothing base class followed by a more useful derived class: class BaseLogSource { public: BaseLogSource() {} virtual ~BaseLogSource() {} }; class LogSource : public BaseLogSource { public: LogSource(const char *name, int num) : filename(name), lineNum(num) {} ~LogSource() {} char *file(); int line(); private: char *filename; int lineNum; }; The LogSource object was created and passed as an argument to the MutexLock object constructor. The LogSource object captured the source file and line number at which the lock was fetched. This information may come in handy when debugging deadlocks
Imagine that sharedcounter was an integer variable accessible to multiple threads and needing serialization.We provided mutual exclusion by inserting a lock object into the local scope: (theKey,og ource(_TiE一一iNE_一J): lasses as w t fragment invok ·BaseLock ·MutexLoc} After the sharedc ·MutexLock ·BaseLoc LogSource BaseLogSource All told,the protection of the shared resource had cost us eight constructors and destructors.The tension between reus and pero ance is a topic that keeps popping upt woud to find ou what sharedcounter update: pthread mutex lock(sthekey) ion thos se inst ctions?That depends on the context,if we putation is small and the that e utes tho enough.It is the ratio of instructions wasted divided by the total instruction count of the overall ve care al ut.Ihe code mple we just ibed was tak gateway ical path that consisted of instructions.was used a few times on that path. That amounted to enough instruction-overhead to make up 10%of the overall cost.which was significant n overkill.If the atement integer increment,why do we need all this object machinery? Maintena of complex routines containing multiple return points Recovery from exceptions. Polymorphism in locking
13 Imagine that sharedCounter was an integer variable accessible to multiple threads and needing serialization. We provided mutual exclusion by inserting a lock object into the local scope: { MutexLock myLock(theKey, LogSource(__FILE__, __LINE__)); sharedCounter++; } The creation of the MutexLock and LogSource objects triggered the invocations of their respective base classes as well. This short fragment invoked a number of constructors: • BaseLogSource • LogSource • BaseLock • MutexLock After the sharedCounter variable was incremented, we encountered the end of the scope that triggers the four corresponding destructors: • MutexLock • BaseLock • LogSource • BaseLogSource All told, the protection of the shared resource had cost us eight constructors and destructors. The tension between reuse and performance is a topic that keeps popping up. It would be interesting to find out what the cost would be if we abandoned all these objects and developed a hand-crafted version that would narrow down by doing exactly what we need and nothing else. Namely, it will just lock around the sharedCounter update: { pthread_mutex_lock(&theKey); sharedCounter++; pthread_mutex_unlock(&theKey); } By inspection alone, you can tell that the latter version is more efficient than the former one. Our objectbased design had cost us additional instructions. Those instructions were entirely dedicated to construction and destruction of objects. Should we worry about those instructions? That depends on the context; if we are in a performance critical flow, we might. In particular, additional instructions become significant if the total cost of the computation is small and the fragment that executes those instructions is called often enough. It is the ratio of instructions wasted divided by the total instruction count of the overall computation that we care about. The code sample we just described was taken out of a gateway implementation that routed data packets from one communication adapter to another. It was a critical path that consisted of roughly 5,000 instructions. The MutexLock object was used a few times on that path. That amounted to enough instruction-overhead to make up 10% of the overall cost, which was significant. If we are going to use C++ and OO in a performance-critical application, we cannot afford such luxury. Before we present a C++ based fix, we would quickly like to point out an obvious design overkill. If the critical section is as simple as a one-statement integer increment, why do we need all this object machinery? The advantages to using lock objects are • Maintenance of complex routines containing multiple return points. • Recovery from exceptions. • Polymorphism in locking
Polymorphism in logging. All those adva oetmnoetgn d logging was als So what about a complex routine where the use of the lock object actually makes sense?We would still ke to reduce cost FIrst s consider That piec ce of I ormation had cost us in this ntext.Often,when Cperformance is discussed,innn is offered as a cure. could help here,it do minate the problem.In the best-case s enario,inlining will eliminate th ven there is the as ignment of the nointer member furthermore when the eobject is created,some additional instructions are required to set up its virtual table pointer. In a critical perfo anc path,a commo ense trad -off is called for.You trade away m The fact that the code using the logsource obiect was enclosed in an #ifdef DEBUG bracket provides further evidence that using this object was not essential.The compile flag was used only during development test,the code that was shipped to o customers was com piled with DEBUG tumed of Wher Lit Thi MutexLock as well as the constructor argument.The partial #ifdef of the o Sourc object was an not is just that your chances of getting away The next step is to eliminate the BaseLock root of the lock class hierarchy.In the case of BaseLock,it doesn't contr ctor signatu 101Ba9 Even virtual table pointer in the object.Saving a single assignment may not be much,but every little bit helps.Inlining the remaining constructor and destructor will eliminate the remaining two function calls The combinatio elass and inlinin almost as efficient as han equiyalent to something lie the olw chennbe MutexLock:theKey key; pthread_mutex_lock(sMutexLock:theKey); pthread mutex unlock(sMutexLock:thekey) 14
14 • Polymorphism in logging. All those advantages were not extremely important in our case. The critical section had a clearly defined single exit point and the integer increment operation was not going to throw an exception. The polymorphism in locking and logging was also something we could easily live without. Interestingly, as this code segment reveals, developers are actually doing this in practice, which indicates that the cost of object construction and destruction is seriously overlooked. So what about a complex routine where the use of the lock object actually makes sense? We would still like to reduce its cost. First let's consider the LogSource object. That piece of information had cost us four function calls: base and derived class constructors and destructors. This is a luxury we cannot afford in this context. Often, when C++ performance is discussed, inlining is offered as a cure. Although inlining could help here, it does not eliminate the problem. In the best-case scenario, inlining will eliminate the function call overhead for all four constructors and destructors. Even then, the LogSource object still imposes some performance overhead. First, it is an extra argument to the MutexLock constructor. Second, there is the assignment of the LogSource pointer member of MutexLock. Furthermore, when the LogSource object is created, some additional instructions are required to set up its virtual table pointer. In a critical performance path, a common sense trade-off is called for. You trade away marginal functionality for valuable performance. The LogSource object has to go. In a constructor, the assignment of a member data field costs a small number of instructions even in the case of a built-in type. The cost per member data field may not be much but it adds up. It grows with the number of data members that are initialized by the constructor. The fact that the code using the LogSource object was enclosed in an #ifdef DEBUG bracket provides further evidence that using this object was not essential. The DEBUG compile flag was used only during development test; the code that was shipped to customers was compiled with DEBUG turned off. When executing in a production environment, we paid the price imposed by the LogSource object, but never actually used it. This was pure overhead. The LogSource should have been completely eliminated by careful #ifdef of all remnants of it. That would include elimination of the pointer member of MutexLock as well as the constructor argument. The partial #ifdef of the LogSource object was an example of sloppy development. This is not terribly unusual; it is just that your chances of getting away with sloppy programming in C++ are slim. The next step is to eliminate the BaseLock root of the lock class hierarchy. In the case of BaseLock, it doesn't contribute any data members and, with the exception of the constructor signature, does not provide any meaningful interface. The contribution of BaseLock to the overall class design is debatable. Even if inlining takes care of the call overhead, the virtual destructor of BaseLock imposes the cost of setting the virtual table pointer in the MutexLock object. Saving a single assignment may not be much, but every little bit helps. Inlining the remaining MutexLock constructor and destructor will eliminate the remaining two function calls. The combination of eliminating the LogSource class, the BaseLock class, and inlining MutexLock constructor and destructor will significantly cut down the instruction count. It will generate code that is almost as efficient as hand-coded C. The compiler-generated code with the inlined MutexLock will be equivalent to something like the following pseudocode: { MutexLock::theKey = key; pthread_mutex_lock(&MutexLock::theKey); sharedCounter++; pthread_mutex_unlock(&MutexLock::theKey); }
The above C++code fragment is almost identical to hand-coded C.and we assumed it would be just as Direct calls topthread mutex lock()and pthread mutex_unlock ( In the first test we simply surrounded the shared resource with a pair ofpthread mutex lock()and pthread mutex_unlock()calls: int main()//Version 1 pthread_mutex_unlock(smutex); /stop timing here swe used alock object.usingtheosructor to lockand the destructo int main() /version 2 simpleMutex sharedCounter++; /Stop timing here was implemented as follows: class simpleMutex /version two.standalone lock class public: simpleMutex(pthread_mutex_t&lock):myLock(lock)(acquire();) ~simpleMutex()(release():) private: pthread mutex t&myLock; Inheritance was added in Version3: class BaseMutex /Version 3.Base class. 公
15 The above C++ code fragment is almost identical to hand-coded C, and we assumed it would be just as efficient. If that is the case, then the object lock provides the added power of C++ without loss of efficiency. To validate our assumption, we tested three implementations of mutex locks: • Direct calls to pthread_mutex_lock() and pthread_mutex_unlock() • A standalone mutex object that does not inherit from a base class • A mutex object derived from a base class In the first test we simply surrounded the shared resource with a pair of pthread_mutex_lock() and pthread_mutex_unlock() calls: int main()// Version 1 { ... // Start timing here for (i = 0; i < 1000000; i++) { pthread_mutex_lock(&mutex); sharedCounter++; pthread_mutex_unlock(&mutex); } // Stop timing here ... } In Version 2 we used a lock object, SimpleMutex, using the constructor to lock and the destructor to unlock: int main() // Version 2 { ... // Start timing here for (i = 0; i < 1000000; i++) { SimpleMutex m(mutex); sharedCounter++; } // Stop timing here ... } SimpleMutex was implemented as follows: class SimpleMutex // Version two. Standalone lock class. { public: SimpleMutex(pthread_mutex_t& lock) : myLock(lock) {acquire();} ~SimpleMutex() {release();} private: int acquire() {return pthread_mutex_lock(&myLock);} int release() {return pthread_mutex_unlock(&myLock);} pthread_mutex_t& myLock; }; Inheritance was added in Version 3: class BaseMutex // Version 3. Base class. {
puo-saseMutexeoseMutex()ock)( 1: class DerivedMutex:public BaseMutex /version 3. public: myLock (Iock)(acquire(); t288ae8etBh8adute6g98 pthread_mutex_ts myLock; In the test loop we replaced simpleMutex with the DerivedMutex: int main()//version 3 for 0;1 sharedcounter /stop timing here The timin nroofthe e oaampion Versionsand that.The s that (V y o Figure 2.1.The cost of inheritance in this example. 6
16 public: BaseMutex(pthread_mutex_t& lock) {}; virtual ~BaseMutex() {}; }; class DerivedMutex: public BaseMutex // Version 3. { public: DerivedMutex(pthread_mutex_t& lock) : BaseMutex(lock), myLock(lock) {acquire();} ~DerivedMutex() {release();} private: int acquire() {return pthread_mutex_lock(&myLock);} int release() {return pthread_mutex_unlock(&myLock);} pthread_mutex_t& myLock; }; In the test loop we replaced SimpleMutex with the DerivedMutex: int main() // Version 3 { ... // Start timing here for (i = 0; i < 1000000; i++) { DerivedMutex m(mutex); sharedCounter++; } // Stop timing here ... } The timing results of running a million iterations of the test loop validated our assumption. Versions 1 and 2 executed in 1.01 seconds. Version 3, however, took 1.62 seconds. In Version 1 we invoked the mutex calls directly—you cannot get more efficient than that. The moral of the story is that using a standalone object did not exact any performance penalty at all. The constructor and destructor were inlined by the compiler and this implementation achieved maximum efficiency. We paid a significant price, however, for inheritance. The inheritance-based lock object (Version 3) degraded performance by roughly 60% (Figure 2.1). Figure 2.1. The cost of inheritance in this example