by payload can quickly consume huge amounts of mem- packet granularity. While this is sufficient for detect ory. For example, on a fully loaded 1 Gbps link, this ing most existing worms, it is easy to envision worms naive approach could generate a 1 GByte table in less for which the invariant content is a string smaller than a than 10 seconds. Memory consumption can be reduced single packet or for which the invariant content occurs considerably by indexing the table using a fixed size hash at different offsets in each instance. However, detecting of the packet payload instead of the full payload. After common strings of at least a minimum length is compu- a certain hash value has repeated r-1 times, the next tationally complex. Instead we address the related -yet packet with this hash is reported. In the absence of colli- far easier-problem of detecting repeating strings with a sions, the associated content will have appeared exactly small fixed length B. As with full packet contents, stor- ar times. By selecting a hash function with suitably large ing individual substrings can require exorbitant memory range(e.g, 32 or 64 bits) the collision probability can be and computational resources. Instead, we use a method minimized. Assuming a 16 byte hash table entry and an similar to the one proposed by Manber for finding simi- flows, frequently called"heavy hitters"[13, 10). By fingerprint, no matter where they are in a packe eb E average packet size of 500 bytes, this algorithm would lar files in a large file system [20]. We compute a variant take over 4 minutes to generate the same 1 GByte table. of Rabin fingerprints for all possible substrings of a Memory efficiency can be improved further by observ- tain length [32]. As these fingerprints are polynomials ng that identifying prevalent content is isomorphic to they can be computed incrementally while retaining the the well-studied problem of identifying high-bandwidth property that two equal substrings will generate modifying the definition of"" to reflect content fin- However, each packet with a payload of s bytes has gerprints instead of the (srcip, dstip, srcport, dstport, s-6+1 strings of length B, so the memory references protocol)tuple used for flow analysis, heavy-hitter ap- used per packet is still substantially greater than that con- proximation algorithms can be used to find prevalent sumed by a single per-packet hash. In Section 5.3, we content using comparatively small amounts of memory. describe a technique called value sampling to consider Our prototype uses multi-stage filters with conserva- ably reduce memory references tive update to dramatically reduce the memory footpr of the problem(see Figure 2 for a general description and 5.2 Estimating address dispersion [13, 10]for a thorough analysis). While simple, we be- While content prevalence is the key metric for identify lieve this notion of using a content signature as a"flow ing potential worm signatures, address dispersion is crit- identifier"on which to maintain counters is a powerful ical for avoiding false positives among this set.With- out this additional test a system could not distinguish be- An important modification is to append the destina- tween a worm and a piece of content that frequently od tion port and protocol to the content before hashing. curs between two computers-for example a mail client Since worms typically target a particular service(they sending the same user name repeatedly as it checks for are designed to exploit a vulnerability in that service) new mail on the mail server regularly this will not impact the ability to track worm traffic, but To quantify address dispersion one must count the dis can effectively exclude large amounts of prevalent con- tinct source Ip addresses and destination ip addresses as tent not generated by worms (i.e, potential false pos tives). For example, if two users on the same network generated by a worm(note that this is different from the both download the Yahoo home page they will receive previous content prevalence problem which only requires many packets with identical payloads. However, traffic estimating the repetitions of each distinct string). While sent from the Web server will be directed to a so-called one could simply count source-destination address pairs ephemeral"port selected by each client. Since these counting the source and destination addresses indepen- ports are selected independently, adding them to the hash dently allows finer distinctions to be made. For example input will generally differentiate these different clients mail messages sent to a popular mailing list are associ- even when the content being carried is identical ated with many source-destination address pairs, but with So far, we have only discussed content at the whole only two sources-the mail server of the original sender 2We are not the first to use hashing techniques to analyze the content of the message and the mail server running the list makeup of network traffic. Snogeglon t aross a network (3.ints for ing a simple list or hash table, more efficient solutions While it is possible to count IP addresses exactly us- ompressing content sent over a network [40. 271 are needed if there are many pieces of content suspected 3Note that it is possible for this assumption to be violated under usual circumstances. In particular, the witty worm exploited of being generated by worms. Our solution is to trade ort to exploit off some precision in these counters for dramatic reduc ts vulnerability -the destination port was random [6]. Catching this tions in memory requirements. Our first approach was orm required us to maintain an additional matching table in which the source port is appended to the hash output instead to appropriate the direct bitmap data structure originally
by payload can quickly consume huge amounts of memory. For example, on a fully loaded 1 Gbps link, this naive approach could generate a 1 GByte table in less than 10 seconds. Memory consumption can be reduced considerably by indexing the table using a fixed size hash of the packet payload instead of the full payload. After a certain hash value has repeated x − 1 times, the next packet with this hash is reported. In the absence of collisions, the associated content will have appeared exactly x times. By selecting a hash function with suitably large range (e.g., 32 or 64 bits) the collision probability can be minimized. Assuming a 16 byte hash table entry and an average packet size of 500 bytes, this algorithm would take over 4 minutes to generate the same 1 GByte table. Memory efficiency can be improved further by observing that identifying prevalent content is isomorphic to the well-studied problem of identifying high-bandwidth flows, frequently called “heavy hitters” [13, 10]. By modifying the definition of “flow” to reflect content fingerprints instead of the (srcip, dstip, srcport, dstport, protocol) tuple used for flow analysis, heavy-hitter approximation algorithms can be used to find prevalent content using comparatively small amounts of memory. Our prototype uses multi-stage filters with conservative update to dramatically reduce the memory footprint of the problem (see Figure 2 for a general description and [13, 10] for a thorough analysis). While simple, we believe this notion of using a content signature as a “flow identifier” on which to maintain counters is a powerful technique.2 An important modification is to append the destination port and protocol to the content before hashing. Since worms typically target a particular service (they are designed to exploit a vulnerability in that service) this will not impact the ability to track worm traffic, but can effectively exclude large amounts of prevalent content not generated by worms (i.e., potential false positives).3 For example, if two users on the same network both download the Yahoo home page they will receive many packets with identical payloads. However, traffic sent from the Web server will be directed to a so-called “ephemeral” port selected by each client. Since these ports are selected independently, adding them to the hash input will generally differentiate these different clients even when the content being carried is identical. So far, we have only discussed content at the whole 2We are not the first to use hashing techniques to analyze the content makeup of network traffic. Snoeren et al. and Duffield et al. both use hashing to match packet observations across a network [38, 7], and both Spring et al. and Muthitacharoen et al. use Rabin fingerprints for compressing content sent over a network [40, 27]. 3Note that it is possible for this assumption to be violated under unusual circumstances. In particular, the Witty worm exploited promiscuous network devices and only required a fixed source port to exploit its vulnerability – the destination port was random [6]. Catching this worm required us to maintain an additional matching table in which the source port is appended to the hash output instead. packet granularity. While this is sufficient for detecting most existing worms, it is easy to envision worms for which the invariant content is a string smaller than a single packet or for which the invariant content occurs at different offsets in each instance. However, detecting common strings of at least a minimum length is computationally complex. Instead we address the related – yet far easier – problem of detecting repeating strings with a small fixed length β. As with full packet contents, storing individual substrings can require exorbitant memory and computational resources. Instead, we use a method similar to the one proposed by Manber for finding similar files in a large file system [20]. We compute a variant of Rabin fingerprints for all possible substrings of a certain length [32]. As these fingerprints are polynomials they can be computed incrementally while retaining the property that two equal substrings will generate the same fingerprint, no matter where they are in a packet. However, each packet with a payload of s bytes has s − β + 1 strings of length β, so the memory references used per packet is still substantially greater than that consumed by a single per-packet hash. In Section 5.3, we describe a technique called value sampling to considerably reduce memory references. 5.2 Estimating address dispersion While content prevalence is the key metric for identifying potential worm signatures, address dispersion is critical for avoiding false positives among this set. Without this additional test a system could not distinguish between a worm and a piece of content that frequently occurs between two computers – for example a mail client sending the same user name repeatedly as it checks for new mail on the mail server regularly. To quantify address dispersion one must count the distinct source IP addresses and destination IP addresses associated with each piece of content suspected of being generated by a worm (note that this is different from the previous content prevalence problem which only requires estimating the repetitions of each distinct string). While one could simply count source-destination address pairs, counting the source and destination addresses independently allows finer distinctions to be made. For example, mail messages sent to a popular mailing list are associated with many source-destination address pairs, but with only two sources – the mail server of the original sender of the message and the mail server running the list. While it is possible to count IP addresses exactly using a simple list or hash table, more efficient solutions are needed if there are many pieces of content suspected of being generated by worms. Our solution is to trade off some precision in these counters for dramatic reductions in memory requirements. Our first approach was to appropriate the direct bitmap data structure originally
content source is hashed to a bitmap, the corresponding i code s Hashe developed for approximate flow counting 146, 11]. Each bit is set, and an alarm is raised when the number of bits tcode= FirstBits(code <<(level+1) set exceeds a threshold. For example, if the dispersion 5 4 if (level base and level basetnumbmps) threshold t is 30. the source address is hashed into a if eveb tod b ase and CeoventBitsset(bitmaps(on = mar) NextConfigurationo of bits set crosses 20( the value 20 is calculated analyti-9endif cally to account for hash collisions). This approach has Compute Estimate(bitmaps, base) minimal memory requirements, but in exchange it loses 2 for i=o to numbmps-I the ability to estimate the actual values of each counter umIPs=numIPs+b In(b/Count Bits NotSet(bitmaps[iD)) important for measuring the rate of infection or priori tizing alerts. While other techniques such as probabilis (2base-1)/(2mumbmps-1), bIn(b/(b- maz)) tic counting [12] and multiresolution bitmaps [11]can 6 return numIPs. 204s/(1-2-mumomps)+correction provide accurate counts they require significantly more Figure 3: A scaled bitmap uses numbmps bitmaps of size b 512 bits to count to 1 million ash space. When t Instead, we have invented a counting algorithm that set reaches max), we advance to the next configuration by "recy leverages the fact that address dispersion continuously clungth et trsapc e ades es weomltupe an estimate of the during an outbreak Using this observation devise a new, compact data structure, called a scaled of the fraction of the hash covered by t bitmap, that accurately estimates address dispersion u were active in earlier configurations, while the current bitmaps ng five times less memory than existing algorithms were not in use at their present levels The scaled bitmap achieves this reduction by subsam- pling the range of the hash space. For example, to count the one covering the largest portion of the hash space, up to 64 sources using 32 bits, one might hash sources is the most important in computing the estimate, but the into a space from 0 to 63 yet only set bits for values other bitmaps provide"memory" for counts that are still that hash between 0 and 31-thus ignoring half of the small and serve to minimize the previously mentioned sources. At the end of a fixed measurement interval. this biases. Consequently, not much correction is needed subsampling is adjusted by scaling the resulting count to when these bitmaps become the most important. For- estimate the true count(a factor of two in the previous ex- mally, we can prove that the maximum ratio between the count by simply increasing this scaling factor whenever is 2/(2numomps-1)[3 e number of active addresses bias of the algorithm and ul the bitmap is filled. For example the next configuration Overall, this new technique allows us to count sources of the bitmap might map one quarter of the hash spac nd destinations quite accurately using only 3 bitmaps to a 32 bit bitmap and scale the resulting count by four. with roughly 5 times less memory than previously known This allows the storage of the bitmap to remain constant techniques [12, Il]. This is critical for practical scaling across an enormous range of counts. because it reduces the systems sensitivity to the effec- However, once the bitmap is scaled to a new configu- tiveness of the low-pass filter provided by the content ration, the addresses that were active throughout the pre- prevalence test vious configuration are lost and adjusting for this bias directly can lead to double counting. To minimize these 5.3 CPU scaling errors, the final scaled bitmap algorithm, shown in Fig- Using multistage filters to detect content prevalence and ure 3, uses multiple bitmaps(numbmps =3 in this ex- scaled bitmaps to estimate address dispersion decreases ample)each mapped to progressively smaller and smaller memory usage and limits the amount of processing portions of the hash space. To calculate the count, the However, each payload string still requires significant estimated number of sources hashing to each bitmap are processing. In our prototype implementation(detailed added, and then this sum is divided by the fraction of in Section 6), the CPU can easily manage processing the hash space covered by all the bitmaps. When the each packet payload as a single string, but when apply bitmap covering the largest portion of the hash space ing Rabin fingerprints, the processing of every substring ly bits set to be accurate, it is advanced to of length B can figuration by recycling it: the bitmap is re- load. For example, a packet with 1, 000 bytes of pay set and then mapped to the next slice of the hash space load and B= 40, requires processing 960 Rabin fin- (Figure 4). Consequently, each bitmap covers half the gerprints. While computing the Rabin fingerprints them hash space covered by its predecessor. The first bitmap, selves incurs overhead, it is the three order of magnitude
developed for approximate flow counting [46, 11]. Each content source is hashed to a bitmap, the corresponding bit is set, and an alarm is raised when the number of bits set exceeds a threshold. For example, if the dispersion threshold T is 30, the source address is hashed into a bitmap of 32 bits and an alarm is raised if the number of bits set crosses 20 (the value 20 is calculated analytically to account for hash collisions). This approach has minimal memory requirements, but in exchange it loses the ability to estimate the actual values of each counter – important for measuring the rate of infection or prioritizing alerts. While other techniques such as probabilistic counting [12] and multiresolution bitmaps [11] can provide accurate counts they require significantly more memory. For example a multiresolution bitmap requires 512 bits to count to 1 million. Instead, we have invented a counting algorithm that leverages the fact that address dispersion continuously increases during an outbreak. Using this observation we devise a new, compact data structure, called a scaled bitmap, that accurately estimates address dispersion using five times less memory than existing algorithms. The scaled bitmap achieves this reduction by subsampling the range of the hash space. For example, to count up to 64 sources using 32 bits, one might hash sources into a space from 0 to 63 yet only set bits for values that hash between 0 and 31 – thus ignoring half of the sources. At the end of a fixed measurement interval, this subsampling is adjusted by scaling the resulting count to estimate the true count (a factor of two in the previous example). Generalizing, we track a continuously increasing count by simply increasing this scaling factor whenever the bitmap is filled. For example the next configuration of the bitmap might map one quarter of the hash space to a 32 bit bitmap and scale the resulting count by four. This allows the storage of the bitmap to remain constant across an enormous range of counts. However, once the bitmap is scaled to a new configuration, the addresses that were active throughout the previous configuration are lost and adjusting for this bias directly can lead to double counting. To minimize these errors, the final scaled bitmap algorithm, shown in Figure 3, uses multiple bitmaps (numbmps = 3 in this example) each mapped to progressively smaller and smaller portions of the hash space. To calculate the count, the estimated number of sources hashing to each bitmap are added, and then this sum is divided by the fraction of the hash space covered by all the bitmaps. When the bitmap covering the largest portion of the hash space has too many bits set to be accurate, it is advanced to the next configuration by recycling it: the bitmap is reset and then mapped to the next slice of the hash space (Figure 4). Consequently, each bitmap covers half the hash space covered by its predecessor. The first bitmap, UpdateBitmap(IP) 1 code = Hash(IP) 2 level = CountLeadingZeroes(code) 3 bitcode = FirstBits(code << (level+1)) 4 if (level ≥ base and level < base+numbmps) 5 SetBit(bitcode,bitmaps[level-base]) 6 if (level == base and CountBitsSet(bitmaps[0]) == max) 7 NextConfiguration() 8 endif 9 endif ComputeEstimate(bitmaps,base) 1 numIPs=0 2 for i= 0 to numbmps-1 3 numIPs=numIPs+b ln(b/CountBitsNotSet(bitmaps[i])) 4 endfor 5 correction= 2(2base − 1)/(2numbmps − 1) · b ln(b/(b − max)) 6 return numIPs ·2 base/(1 − 2 −numbmps)+correction Figure 3: A scaled bitmap uses numbmps bitmaps of size b bits each. The bitmaps cover progressively smaller portions of the hash space. When the bitmap covering the largest portion of the hash space gets too full to be accurate (the number of bits set reaches max), we advance to the next configuration by “recycling” the bitmap (see Figure 4). To compute an estimate of the number of distinct IP addresses, we multiply a estimate of the number of addresses that mapped to the bitmaps by the inverse of the fraction of the hash space covered by the bitmaps. A correction is added to the result to account for the IP addresses that were active in earlier configurations, while the current bitmaps were not in use at their present levels. the one covering the largest portion of the hash space, is the most important in computing the estimate, but the other bitmaps provide “memory” for counts that are still small and serve to minimize the previously mentioned biases. Consequently, not much correction is needed when these bitmaps become the most important. Formally, we can prove that the maximum ratio between the bias of the algorithm and the number of active addresses is 2/(2numbmps − 1)[36]. Overall, this new technique allows us to count sources and destinations quite accurately using only 3 bitmaps with roughly 5 times less memory than previously known techniques [12, 11]. This is critical for practical scaling because it reduces the system’s sensitivity to the effectiveness of the low-pass filter provided by the content prevalence test. 5.3 CPU scaling Using multistage filters to detect content prevalence and scaled bitmaps to estimate address dispersion decreases memory usage and limits the amount of processing. However, each payload string still requires significant processing. In our prototype implementation (detailed in Section 6), the CPU can easily manage processing each packet payload as a single string, but when applying Rabin fingerprints, the processing of every substring of length β can overload the CPU during high traffic load. For example, a packet with 1, 000 bytes of payload and β = 40, requires processing 960 Rabin fingerprints. While computing the Rabin fingerprints themselves incurs overhead, it is the three order of magnitude