There is aso the option of rolling your own version even if a particular service is already available 一rtmcstaesoeeaeeeRakeTeaagahop the time. unrolling lifting stant expre itat all.For ultimateconroyou have to take coding matters into your own hands. Our Goal Many books and articles have xtoled the virtes ofas a language supporting the paradigm.C+ onsto complex problems [CE95] creating hi and r will len ion.One important issue,however,has received little attention:run-time efficienc rspective of programming toavod them without compromising the crity licityof your,the high cce solu tion is frequently also the simplest s should also help opers and the inherent superiority of the paradig.famous physicist once said that an ex rt is one who as mad all possible n akes in a very na Although maki learning from f others (the h A secondary goal of this book is to construct a one-stop shop for C++performance issues.As a C++ oper,the ans ers to your perfo e not readily av alable.The are scattered over a h this s pu ther self.Not ng to do that.We are to b helpfhvso sop thatfosirly on the imorant to of Software Efficiency:Does It Matter? ed doubles 18 months(Moor about need to efficiency is still a prominent concem.In 1971,the Intel 4004 was the first cor nercial proc a mi 18 of ssor issue would have been re ations.Wh TearxiFly
xi There is also the option of rolling your own version even if a particular service is already available in a library. Libraries are often designed with flexibility and reusability in mind. Often, flexibility and reusability trade off with performance. If, for some critical code fragment, you choose to put performance considerations above the other two, it might be reasonable to override a library service with your own home-grown implementation. Applications are so diverse in their specific needs, it is hard to design a library that will be the perfect solution for everybody, everywhere, all the time. • Compiler optimizations Simply a more descriptive name than “miscellaneous,” this category includes all those small coding tricks that don’t fit in the other coding categories, such as loop unrolling, lifting constant expressions out of loops, and similar techniques for elimination of computational redundancies. Most compilers will perform many of those optimizations for you. But you cannot count on any specific compiler to perform a specific optimization. One compiler may unroll a loop twice, another will unroll it four times, and yet another compiler will not unroll it at all. For ultimate control, you have to take coding matters into your own hands. Our Goal Many books and articles have extolled the virtues of C++ as a language supporting the OO paradigm. C++ is positioned as the latest cure for the software crisis. For those not familiar with the term, the software crisis is our current inability to develop code that is simple enough to be understood, maintained, and extended by a mere mortal, yet powerful enough to provide solutions to complex problems [CE95]. Developers who migrate from other structured languages to C++ have been bombarded with information pertaining to the use of C++ in creating highly flexible and reusable code that will lend itself nicely to easy maintenance and extension. One important issue, however, has received little attention: run-time efficiency. We will examine the relevant performance topics from the perspective of C++ programming. After reading this book you should emerge with a clear understanding of the common C++ performance pitfalls and how to avoid them without compromising the clarity and simplicity of your design. In fact, the highperformance solution is frequently also the simplest solution. This book should also help developers produce C++ code as efficient as its C counterpart while still benefiting from the extended features of C++ and the inherent superiority of the OO paradigm. A famous physicist once said that an expert is one who has made all possible mistakes in a very narrow field. Although making mistakes is a good way to learn, learning from the mistakes of others (the authors, in this case) is even better. A secondary goal of this book is to construct a one-stop shop for C++ performance issues. As a C++ developer, the answers to your performance concerns are not readily available. They are scattered over a long list of books and magazine articles that address different pieces of this puzzle. You would have to research this topic and put it all together yourself. Not many developers are going to do that. We are too busy. It will be helpful to have a one-stop shop that focuses entirely on the important topic of C++ performance. Software Efficiency: Does It Matter? In an era where processor speed doubles every 18 months (Moore's law), do we really need to worry about software efficiency? The fact is that regardless of phenomenal advances in chip technology, software efficiency is still a prominent concern. In 1971, the Intel 4004 was the first commercial processor to fit on a single chip. It was named a microprocessor. Since then, microprocessor technology has embarked on a 25-year streak of doubling processor speed every 18 months. Today's microprocessors are tens of thousands of times faster than the Intel 4004. If processor speed was the answer to inefficient software, the issue would have been resolved and long forgotten. Yet, software efficiency is still a concern with most development organizations. Why? TEAMFLY Team-Fly®
on60 transactions per second before running out on the atestnd reat the c nee d to string together a cluster of at le st 10 servers to reach and maint on in te has in our competitors to pitch tation the eaper solution.The same hardware.It is often the case that the most efficient solution wins the bid. The limits imposed by physics might soo ut the brakes on the fantastic growth of pr sor speed [Lewl].Not so for eabits per second is common.The end of the road for communication speed is nowhere in sight [Lew2] per-second all-opti cal networking.The biggest obstacle currently is not chnical nature:it is the y fron faste than the computing dev to them.Emerging network te gies such as 100 Mbps LAN e has Po ith 1 E g adapt leaving software perforr ce bottlenecks undetected.Not so ,000 instructions I to the pro receive pat graded throughput on a Fast Eth net adapter. very fw computers are capable of saturating a becoming the new bottleneck,and it's going to stay that way. ake a lo bred of bandwith-and yc-hun y applications that push the boundaries of techn y burden on the sof vare.Further advances in execution speed will depend Terminology minimize the use of mem mory in a softy nted in terms of response time and ile time able size. oved the topic of space efficiency for its own sake to the back xii
xii Imagine that you are trying to sell your product, say a Web application server, to a Fortune 500 company. They need 600 transactions per second to run their business online. Your application server can support only 60 transactions per second before running out of steam on the latest and greatest server hardware. If the customer is to use your software, they need to string together a cluster of at least 10 servers to reach their 600-transaction per second goal, raising the cost of your solution in terms of hardware, software licenses, network administration, and maintenance. To make matters worse, the customer has invited two of your competitors to pitch their own solutions. If a competitor has a more efficient implementation, they will need less hardware to deliver the required performance, and they will offer a cheaper solution. The speed of the processor is a constant in this situation—the software vendors in this story compete over the same hardware. It is often the case that the most efficient solution wins the bid. You also must examine how processing speed compares to communication speed. If we can transmit data faster than the computer can generate it, then the computer (processor plus software) is the new bottleneck. The limits imposed by physics might soon put the brakes on the fantastic growth of processor speed [Lew1]. Not so for communication speed. Like processing speed, communication speed has enjoyed phenomenal growth. Back in 1970, 4800 bits per second was considered high-speed communication. Today, hundreds of megabits per second is common. The end of the road for communication speed is nowhere in sight [Lew2]. Optical communication technology does not seem to have show-stopping technological roadblocks that will threaten progress in the near future. Several research labs are already experimenting with 100-gigabitper-second all-optical networking. The biggest obstacle currently is not of a technical nature; it is the infrastructure. High-speed networking necessitates the rewiring of the information society from copper cables to optical fiber. This campaign is already underway. Communication adapters are already faster than the computing devices attached to them. Emerging network technologies such as 100 Mbps LAN adapters and high-speed ATM switches make computer speed critical. In the past, inefficient software has been masked by slow links. Popular communication protocols such as SNA and TCP/IP could easily overwhelm a 16 Mbps token ring adapter, leaving software performance bottlenecks undetected. Not so with 100 Mbps FDDI or Fast Ethernet. If 1,000 instructions are added to the protocol's send/receive path, they may not degrade throughput on a token ring connection because the protocol implementation can still pump data faster than the token ring can consume it. But an extra 1,000 instructions show up instantly as degraded throughput on a Fast Ethernet adapter. Today, very few computers are capable of saturating a high-speed link, and it is only going to get more difficult. Optical communication technology is now surpassing the growth rate of microprocessor speed. The computer (processor plus software) is quickly becoming the new bottleneck, and it's going to stay that way. To make a long story short, software performance is important and always will be. This one is not going away. As processor and communication technology march on, they redefine what "fast" means. They give rise to a new breed of bandwidth- and cycle-hungry applications that push the boundaries of technology. You never have enough horsepower. Software efficiency now becomes even more crucial than before. Whether the growth of processor speed is coming to an end or not, it will definitely trail communication speed. This puts the efficiency burden on the software. Further advances in execution speed will depend heavily on the efficiency of the software, not just the processor. Terminology Before moving on, here are a few words to clarify the terminology. "Performance" can stand for various metrics, the most common ones being space efficiency and time efficiency. Space efficiency seeks to minimize the use of memory in a software solution. Likewise, time efficiency seeks to minimize the use of processor cycles. Time efficiency is often represented in terms of response time and throughput. Other metrics include compile time and executable size. The rapidly falling price of memory has moved the topic of space efficiency for its own sake to the back burner. Desktop PCs with plenty of RAM (Random Access Memory) are common. Corporate customers
are not that concerned about space issues these days.In our work with customers we have encountered interpretation.Generally we will look at space considerationsonly when they interfere with run-time performance,as in caching and paging. In discussing time efficie we will often mention the terms "pathlength"and "instructi unt interchang ly.Both stand for the number of assembler lang instructions enerated by a fragment of hsha hibits a rea oabe6cal e to or more.but in an event.poor instruction indicate rexecution time, regardless of processorar chitecture.A good tionwithtime Organization of This Book We start the performance tour close to home with a real-life example.Chapter 1 is a ar story of C++code This example will drive home some Object-oriented design in C++might harbor a performance ost.This is what we pay for the power of OO e of this cost,the factors affecting it,and how and when you can get around it are Chapter5is dedicated to temporaries.The creation of temporary objects is aC++feature that catches new Cprogrammers are not used to the( aries are g erated by theC and how to oid them. now when Memory managen 6c四01o的8N5 uncti ssucn as ne an ete de ared Oftentimes,you are in a position to make simplifying assumptions about code that will significantly boost the spe ol me ory alloca de ocation.I he chap Inlining is probably the second most r opular performance tip.right after passing objects by reference.It is not as simple as it sounds.Thek yword,just likes,is just a hint that the compiler often ne is likely to be ignored and other unexpected consequences are 10 Performance,flexibility,and reuse seldom go hand-in-hand.The Standard Template Library is an attempt ofnd an e three into a powertul component.we will examine the Reference counting is a technique often used by omwithout coverage of this technique.discussed in Chapter 12. Software perfo ocyml l mcieneaion d by t seem e su xiii
xiii are not that concerned about space issues these days. In our work with customers we have encountered concerns with run-time efficiency for the most part. Since customers drive requirements, we will adopt their focus on time efficiency. From here on, we will restrict performance to its time-efficiency interpretation. Generally we will look at space considerations only when they interfere with run-time performance, as in caching and paging. In discussing time efficiency, we will often mention the terms "pathlength" and "instruction count" interchangeably. Both stand for the number of assembler language instructions generated by a fragment of code. In a RISC architecture, if a code fragment exhibits a reasonable "locality of reference" (i.e., cache hits), the ratio between instruction counts and clock cycles will approximate one. On CISC architectures it may average two or more, but in any event, poor instruction counts always indicate poor execution time, regardless of processor architecture. A good instruction count is necessary but not sufficient for high performance. Consequently, it is a crude performance indicator, but still useful. It will be used in conjunction with time measurements to evaluate efficiency. Organization of This Book We start the performance tour close to home with a real-life example. Chapter 1 is a war story of C++ code that exhibited atrocious performance, and what we did to resolve it. This example will drive home some performance lessons that might very well apply to diverse scenarios. Object-oriented design in C++ might harbor a performance cost. This is what we pay for the power of OO support. The significance of this cost, the factors affecting it, and how and when you can get around it are discussed in Chapters 2, 3, and 4. Chapter 5 is dedicated to temporaries. The creation of temporary objects is a C++ feature that catches new C++ programmers off guard. C programmers are not used to the C compiler generating significant overhead "under the covers." If you aim to write high-efficiency C++, it is essential that you know when temporaries are generated by the C++ compiler and how to avoid them. Memory management is the subject of Chapters 6 and 7. Allocating and deallocating memory on the fly is expensive. Functions such as new() and delete() are designed to be flexible and general. They deal with variable-sized memory chunks in a multithreaded environment. As such, their speed is compromised. Oftentimes, you are in a position to make simplifying assumptions about your code that will significantly boost the speed of memory allocation and deallocation. These chapters will discuss several simplifying assumptions that can be made and the efficient memory managers that are designed to leverage them. Inlining is probably the second most popular performance tip, right after passing objects by reference. It is not as simple as it sounds. The inline keyword, just like register, is just a hint that the compiler often ignores. Situations in which inline is likely to be ignored and other unexpected consequences are discussed in Chapters 8, 9, and 10. Performance, flexibility, and reuse seldom go hand-in-hand. The Standard Template Library is an attempt to buck that trend and to combine these three into a powerful component. We will examine the performance of the STL in Chapter 11. Reference counting is a technique often used by experienced C++ programmers. You cannot dedicate a book to C++ performance without coverage of this technique, discussed in Chapter 12. Software performance cannot always be salvaged by a single "silver bullet" fix. Performance degradation is often a result of many small local inefficiencies, each of which is insignificant by itself. It is the combination that results in a significant degradation. Over the years, while resolving many performance bugs in various C++ products, we have come to identify certain bugs that seem to float to the surface
Wedivided the istnetcodinand designf Thensetcontains understand the overall design.In Chapter we discuss various items of that nature The second set contains design optimizations that are global in nature.Those optimizations modify code that is spread across the source code,and are the subject of Chapter 14 Chapter 15 covers scalability issues.unique performance considerations present in a multiprocessor envrom that we dot np scusses design and coding issues ming and sy to th ept in several other places in the book.If your exposure to those concepts,help level the playing field xiv
xiv frequently. We divided the list into two sets: coding and design inefficiencies. The coding set contains "low-hanging fruit"—small-scale, local coding optimizations you can perform without needing to understand the overall design. In Chapter 13 we discuss various items of that nature. The second set contains design optimizations that are global in nature. Those optimizations modify code that is spread across the source code, and are the subject of Chapter 14. Chapter 15 covers scalability issues, unique performance considerations present in a multiprocessor environment that we don't encounter on a uniprocessor. This chapter discusses design and coding issues aimed at exploiting parallelism. This chapter will also provide some help with the terminology and concepts of multithreaded programming and synchronization. We refer to thread synchronization concepts in several other places in the book. If your exposure to those concepts is limited, Chapter 15 should help level the playing field. Chapter 16 takes a look at the underlying system. Top-notch performance also necessitates a rudimentary understanding of underlying operating systems and processor architectures. Issues such as caching, paging, and threading are discussed here
Chapter 1.The Tracing War Story Eve worked on contained tra orm or another dtbggmaninnne,andntnteectiowhovoinoantralotareYouoldnotepeta ce d n in a adation due to fve run er som ce principles that often on.It is simple and familiar.We don't e home many pertormance ues that you a ncounter in any Many C++programmers define a simpleclass to print diagnostic information toalog file Pr ogrammers can d race ob in ea at the want to the Trace class ammer find problems without using a debugger.If your C++code happe to be caive coenom,ungereyourdebe The most extreme form of trace performance optimization would be to eliminate the performance cost altogether by embedding trace calls insidedef blocks: uction");//Constructor takes a function name argument The weaknessofthe approach is that you must recompile touracingonand ff This is ngyour customers will not b able to d o unless you jump on the free communicating with the runing program.Theace class implementation could check the trace state prior to logging any trace information: 2ea8e:asboastrioa6asal if (traceIsActive) /log message here expect our code to exhibit typical trace statement wi something along the lines c t.debug("x ="itoa(x))://itoa()converts an int to ascii the
1 Chapter 1. The Tracing War Story Every software product we have ever worked on contained tracing functionality in one form or another. Any time your source code exceeds a few thousand lines, tracing becomes essential. It is important for debugging, maintaining, and understanding execution flow of nontrivial software. You would not expect a trace discussion in a performance book but the reality is, on more than one occasion, we have run into severe performance degradation due to poor implementations of tracing. Even slight inefficiencies can have a dramatic effect on performance. The goal of this chapter is not necessarily to teach proper trace implementation, but to use the trace vehicle to deliver some important performance principles that often surface in C++ code. The implementation of trace functionality runs into typical C++ performance obstacles, which makes it a good candidate for performance discussion. It is simple and familiar. We don't have to drown you in a sea of irrelevant details in order to highlight the important issues. Yet, simple or not, trace implementations drive home many performance issues that you are likely to encounter in any random fragment of C++ code. Many C++ programmers define a simple Trace class to print diagnostic information to a log file. Programmers can define a Trace object in each function that they want to trace, and the Trace class can write a message on function entry and function exit. The Trace objects will add extra execution overhead, but they will help a programmer find problems without using a debugger. If your C++ code happens to be embedded as native code in a Java program, using a Java debugger to trace your native code would be a challenge. The most extreme form of trace performance optimization would be to eliminate the performance cost altogether by embedding trace calls inside #ifdef blocks: #ifdef TRACE Trace t("myFuction"); // Constructor takes a function name argument t.debug("Some information message"); #endif The weakness of the #ifdef approach is that you must recompile to turn tracing on and off. This is definitely something your customers will not be able to do unless you jump on the free software bandwagon and ship them your source code. Alternatively, you can control tracing dynamically by communicating with the running program. The Trace class implementation could check the trace state prior to logging any trace information: void Trace::debug(string &msg) { if (traceIsActive) { // log message here } } We don't care about performance when tracing is active. It is assumed that tracing will be turned on only during problem determination. During normal operation, tracing would be inactive by default, and we expect our code to exhibit peak performance. For that to happen, the trace overhead must be minimal. A typical trace statement will look something along the lines of t.debug("x = " + itoa(x)); // itoa() converts an int to ascii This typical statement presents a serious performance problem. Even when tracing is off, we still must create the string argument that is passed in to the debug() function. This single statement hides substantial computation: