IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11.NOVEMBER 2017 3157 Efficient Data Center Flow Scheduling Without Starvation Using Expansion Ratio Sheng Zhang,Member,IEEE,Zhuzhong Qian,Member,IEEE,Hao Wu,and Sanglu Lu,Member,IEEE Abstract-Existing data center transport protocols are usually based on the Processor Sharing(PS)policy and/or the Shortest Remaining Processing Time(SRPT)policy.PS divides link bandwidth equally between competing flows,thus it fails to achieve optimal average flow completion time(FCT).SRPT prioritizes flows that have the shortest remaining processing time and provides near-optimal average FCT,but it may cause long flows to suffer unfair delays,or even starve them.In fact,these two types of policies represent two directions in the design space:PS prefers fairness (in terms of starvation freedom)while SRPT favors efficiency (in terms of average FCT).In this paper,we propose a novel metric,expansion ratio,which enables us to strike a balance between SRPT and PS.We design MERP that achieves efficient flow scheduling without starvation.MERP takes care of both average and tail FCTs by minimizing the expansion ratio of competing flows in a lexicographically manner.MERP controls the sending rate of competing flows via synchronized virtual deadlines and routes flows in a downstream-aware manner that reacts quickly to link failures.We evaluate MERP using extensive NS2-based simulations.Results show that,under various traffic loads,MERP reduces the tail FCT significantly with a negligible increase of average FCT compared with pFabric,and MERP reduces the average FCT notably compared with ECMP and CONGA when link failures occur. Index Terms-Data center,flow completion time,efficiency,starvation freedom,expansion ratio 1 INTRODUCTION ODAY's data centers usually organize online interactive partition-aggregate-based online services.This is because, L services with the partition-aggregate pattern.The comple- in such kind of services,there are flows with various sizes, tion time of a large task largely depends on the flow comple- and the overall response time is often dominated by the tion times (FCT)of the flows generated in this process.In "lagger".In this sense,some online service may have better current data centers,the FCTs of these flows fluctuate dramati- performance with short tail FCT rather than short average cally [1]-they may experience 2x more mean FCT than its the- FCT.The second concern with SRPT is that,an malicious oretical minimum [2],[3].This is unacceptable,since the service could preempt bandwidth resources by simply split- response to user requests for those interactive services must be ting large flows into many small ones.If all services split quick enough:even fractions of a second could have an impor- their large flows into small ones,none of them can achieve tant influence on user experiences,which in turn impairs the better performance than no split. business revenue of cloud service providers [41,[5]. PS-based policies [8],[9],[10]reduce the queueing length To achieve low latency,many transport protocols have at switch ports through novel congestion detection and rate been proposed in recent years,among which two main tech- control mechanisms.These approaches usually divide link niques are Shortest Remaining Processing Time(SRPT)and bandwidth equally among competing flows,and thus short Processor Sharing(PS). flows are often blocked by long flows even though they can SRPT-based policies [2],[6]focus on minimizing the finish earlier.Therefore,the average FCT of PS-based poli- average FCT by prioritizing flows that have the shortest cies is larger than that of SRPT-based policies [2],which remaining processing time and have near-optimal average implicitly implies that,we cannot simultaneously achieve FCT.However,applying SRPT-based policies has several the minimum average and the minimum tail FCTs. problems.First,long flows are completely ignored by SRPT- We observe that,these two types of policies represent two based protocols [7].When short flows contribute the major- extremes in the design space:SRPT favors efficiency (in terms ity of the overall traffic,SRPT could even unfairly "starve" of average FCT)while PS prefers fairness (in terms of starva- long flows,which probably increases the response time of tion freedom).To strike a balance between them,in this paper, we propose a novel metric,i.e.,expansion ratio,which is the ratio of the actual FCT of a flow to the optimal FCT.Under The authors are with the State Key Laboratory for Novel Software Technol- this definition,if two flows are delayed on the same link by ogy,Nanjing University,Nanjing Shi 210008,China.E-mail:(sheng,qzz, sanglul@nju.edu.cn,wuhao@dislab.nju.edu.cn. the same amount of time,then the expansion ratio of the Manuscript received 4 Apr.2016;revised 8 Apr.2017;accepted 15 May 2017. shorter flow is larger than that of the other one;and the expan- Date of publication 19 May 2017;date of currenf version 11 Oct.2017. sion ratio of a flow increases as its waiting time increases. (Corresponding author:Sheng Zhang.) These properties enable us to design MERP,a distributed pro- Recommended for acceptance by H.Jin tocol that takes care of both average and tail FCTs. For information on obtaining reprints of this article,please send e-mail to: reprints@ieee.org,and reference the Digital Object Identifier below. The rate control in MERP decides the transmitting order Digital Object Identifier no.10.1109/TPDS.2017.2706290 of concurrent flows through minimizing the maximum
Efficient Data Center Flow Scheduling Without Starvation Using Expansion Ratio Sheng Zhang , Member, IEEE, Zhuzhong Qian, Member, IEEE, Hao Wu, and Sanglu Lu, Member, IEEE Abstract—Existing data center transport protocols are usually based on the Processor Sharing (PS) policy and/or the Shortest Remaining Processing Time (SRPT) policy. PS divides link bandwidth equally between competing flows, thus it fails to achieve optimal average flow completion time (FCT). SRPT prioritizes flows that have the shortest remaining processing time and provides near-optimal average FCT, but it may cause long flows to suffer unfair delays, or even starve them. In fact, these two types of policies represent two directions in the design space: PS prefers fairness (in terms of starvation freedom) while SRPT favors efficiency (in terms of average FCT). In this paper, we propose a novel metric, expansion ratio, which enables us to strike a balance between SRPT and PS. We design MERP that achieves efficient flow scheduling without starvation. MERP takes care of both average and tail FCTs by minimizing the expansion ratio of competing flows in a lexicographically manner. MERP controls the sending rate of competing flows via synchronized virtual deadlines and routes flows in a downstream-aware manner that reacts quickly to link failures. We evaluate MERP using extensive NS2-based simulations. Results show that, under various traffic loads, MERP reduces the tail FCT significantly with a negligible increase of average FCT compared with pFabric, and MERP reduces the average FCT notably compared with ECMP and CONGA when link failures occur. Index Terms—Data center, flow completion time, efficiency, starvation freedom, expansion ratio Ç 1 INTRODUCTION TODAY’S data centers usually organize online interactive services with the partition-aggregate pattern. The completion time of a large task largely depends on the flow completion times (FCT) of the flows generated in this process. In current data centers, the FCTs of these flows fluctuate dramatically [1]—they may experience 2x more mean FCT than its theoretical minimum [2], [3]. This is unacceptable, since the response to user requests for those interactive services must be quick enough: even fractions of a second could have an important influence on user experiences, which in turn impairs the business revenue of cloud service providers [4], [5]. To achieve low latency, many transport protocols have been proposed in recent years, among which two main techniques are Shortest Remaining Processing Time (SRPT) and Processor Sharing (PS). SRPT-based policies [2], [6] focus on minimizing the average FCT by prioritizing flows that have the shortest remaining processing time and have near-optimal average FCT. However, applying SRPT-based policies has several problems. First, long flows are completely ignored by SRPTbased protocols [7]. When short flows contribute the majority of the overall traffic, SRPT could even unfairly “starve” long flows, which probably increases the response time of partition-aggregate-based online services. This is because, in such kind of services, there are flows with various sizes, and the overall response time is often dominated by the “lagger”. In this sense, some online service may have better performance with short tail FCT rather than short average FCT. The second concern with SRPT is that, an malicious service could preempt bandwidth resources by simply splitting large flows into many small ones. If all services split their large flows into small ones, none of them can achieve better performance than no split. PS-based policies [8], [9], [10] reduce the queueing length at switch ports through novel congestion detection and rate control mechanisms. These approaches usually divide link bandwidth equally among competing flows, and thus short flows are often blocked by long flows even though they can finish earlier. Therefore, the average FCT of PS-based policies is larger than that of SRPT-based policies [2], which implicitly implies that, we cannot simultaneously achieve the minimum average and the minimum tail FCTs. We observe that, these two types of policies represent two extremes in the design space: SRPT favors efficiency (in terms of average FCT) while PS prefers fairness (in terms of starvation freedom). To strike a balance between them, in this paper, we propose a novel metric, i.e., expansion ratio, which is the ratio of the actual FCT of a flow to the optimal FCT. Under this definition, if two flows are delayed on the same link by the same amount of time, then the expansion ratio of the shorter flow is larger than that of the other one; and the expansion ratio of a flow increases as its waiting time increases. These properties enable us to design MERP, a distributed protocol that takes care of both average and tail FCTs. The rate control in MERP decides the transmitting order of concurrent flows through minimizing the maximum The authors are with the State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing Shi 210008, China. E-mail: {sheng, qzz, sanglu}@nju.edu.cn, wuhao@dislab.nju.edu.cn. Manuscript received 4 Apr. 2016; revised 8 Apr. 2017; accepted 15 May 2017. Date of publication 19 May 2017; date of current version 11 Oct. 2017. (Corresponding author: Sheng Zhang.) Recommended for acceptance by H. Jin. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2017.2706290 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017 3157 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
3158 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11,NOVEMBER 2017 expansion ratio of them.MERP senders insert some flow TABLE 1 states (e.g.,flow size,and expected sending rate)into each Motivating Example packet header,and such kind of states are updated by inter- mediate MERP switches.Each MERP switch maintains a Flow ID Arrival Time Size flow table that contains the state information (e.g.,virtual A 0 deadline)of current active flows that pass through the B 3 C 0 4 switch.Switches synchronize their decisions through the expected sending rate in each packet header.The flow with the earliest virtual deadline always monopolizes link As shown in Fig.la,a short flow (e.g.,A)will monopolize resources;the other flows periodically send detecting pack- the link resources until it is completed or preempted by ets to check whether they can start transmission,and the another shorter flow.When flow A finishes,although flow detecting rate is also effectively controlled by MERP.Exten- C arrives earlier than flow B,flow B will occupy the link sive NS2-based evaluations show that,compared with pFa- resources since it has a smaller remaining flow size than bric [21,MERP reduces the expansion ratio and tail FCT by flow C.The average FCT of SRPT is (3+3+10)/3=5 more than 15 and 12 percent,respectively,with only a negli- Although SRPT can achieve near-optimal performance in gible increase of average FCT. terms of average FCT [21,it may starve long flows.Consider The route control in MERP consists of a novel two-stage Fig.la,if there are a large number of flows with a size of 3 routing strategy,which first selects a subset of next-hop (i.e.,similar to flow A),with SRPT,flow C would be delayed ports that minimize the expansion ratio of the newly-until all these short flows are completed.Performance of coming flow and then chooses one of them to maximize the long flows may be hurt because they are scheduled later sending rate of the tail flow of a path.NS2-based evalua- when competing with short flows.Unfortunately,many tions show that,in multi-path data center topologies (e.g., online services follow the partition-aggregate pattern,in Fat-tree [111),compared with ECMP and CONGA,MERP which constructing the final response requires all intermedi- reduces the average FCT by up to 31 percent when there are ate results from smaller tasks.These intermediate results are link failures. flows with different sizes.Hence,the user-perceived latency We summarize our contributions here as follows. depends on tail FCT as well as average FCT [13].Therefore, SRPT may starve long flows and thus impair user experience. We propose the expansion ratio metric that enables PS.PS focuses on achieving fairness between competing fast flow completion and starvation freedom. flows.If we apply PS to the above example,the switch We design MERP,which controls the sending rates would divide the outgoing link bandwidth equally among of competing flows through synchronizing virtual three flows,as shown in Fig.1b. deadlines stored on intermediate switches,and PS can achieve fairness between competing flows with no routes flows in a downstream-aware manner that need to know their sizes in advance.However,the average reacts quickly to link failures. FCT of PS may be far from the minimum.For example,the We evaluate MERP using extensive NS2-based simu- average FCT of PS in Fig.1b is(7.5+9.5+7)/3=8,which lations that confirm our claims. is much larger than that of SRPT.Therefore,PS may hurt We now continue by motivating our work in Section 2. flow completion efficiency. Section 3 gives an overview of MERP.We present rate con- In summary,current flow scheduling policies can hardly trol in Section 4 and route control in Section 5.We evaluate MERP in Section 6.Before concluding the paper in Section 8, achieve flow completion efficiency (in terms of average FCT)and starvation freedom simultaneously,which are we present related work in Section 7. two primary goals in many applications such as web serv- ers [141.In this paper,we propose to schedule flows using 2 MOTIVATION the expansion ratio metric and design the MERP protocol. We start by presenting an example to demonstrate the drawbacks of SRPT and PS(Section 2.1).We then introduce 2.2 Expansion Ratio and Design Intuition the new metric and design intuitions of MERP(Section 2.2). MERP is intended to minimize average FCT without starv- ing long flows.To realize this,MERP should mimics SRPT 2.1 Drawbacks of SRPT and PS but meanwhile guaranteeing starvation freedom.Therefore, MERP should be able to: Consider the example shown in Table 1,where three flows (A,B,and C)share a common link. make short flows transmit data before long flows, SRPT.SRPT seeks to minimize average FCT [12]:the flow since SRPT prioritizes the flow with the smallest with the smallest remaining size is always transmitted first. remaining size and has near-optimal average FCT; link capacity link capacity link capacity B 4 6 R910 2 4 5.678910 (a)SRPT (b)PS (c)MERP Fig.1.Motivating example.(a)SRPT transmits the flow with the smallest remaining size first.(b)PS shares bandwidth between concurrent flows. (c)MERP schedules flows according to their dynamic expansion ratios
expansion ratio of them. MERP senders insert some flow states (e.g., flow size, and expected sending rate) into each packet header, and such kind of states are updated by intermediate MERP switches. Each MERP switch maintains a flow table that contains the state information (e.g., virtual deadline) of current active flows that pass through the switch. Switches synchronize their decisions through the expected sending rate in each packet header. The flow with the earliest virtual deadline always monopolizes link resources; the other flows periodically send detecting packets to check whether they can start transmission, and the detecting rate is also effectively controlled by MERP. Extensive NS2-based evaluations show that, compared with pFabric [2], MERP reduces the expansion ratio and tail FCT by more than 15 and 12 percent, respectively, with only a negligible increase of average FCT. The route control in MERP consists of a novel two-stage routing strategy, which first selects a subset of next-hop ports that minimize the expansion ratio of the newlycoming flow and then chooses one of them to maximize the sending rate of the tail flow of a path. NS2-based evaluations show that, in multi-path data center topologies (e.g., Fat-tree [11]), compared with ECMP and CONGA, MERP reduces the average FCT by up to 31 percent when there are link failures. We summarize our contributions here as follows. We propose the expansion ratio metric that enables fast flow completion and starvation freedom. We design MERP, which controls the sending rates of competing flows through synchronizing virtual deadlines stored on intermediate switches, and routes flows in a downstream-aware manner that reacts quickly to link failures. We evaluate MERP using extensive NS2-based simulations that confirm our claims. We now continue by motivating our work in Section 2. Section 3 gives an overview of MERP. We present rate control in Section 4 and route control in Section 5. We evaluate MERP in Section 6. Before concluding the paper in Section 8, we present related work in Section 7. 2 MOTIVATION We start by presenting an example to demonstrate the drawbacks of SRPT and PS (Section 2.1). We then introduce the new metric and design intuitions of MERP (Section 2.2). 2.1 Drawbacks of SRPT and PS Consider the example shown in Table 1, where three flows (A, B, and C) share a common link. SRPT. SRPT seeks to minimize average FCT [12]: the flow with the smallest remaining size is always transmitted first. As shown in Fig. 1a, a short flow (e.g., A) will monopolize the link resources until it is completed or preempted by another shorter flow. When flow A finishes, although flow C arrives earlier than flow B, flow B will occupy the link resources since it has a smaller remaining flow size than flow C. The average FCT of SRPT is ð3 þ 3 þ 10Þ=3 ¼ 5 1 3. Although SRPT can achieve near-optimal performance in terms of average FCT [2], it may starve long flows. Consider Fig. 1a, if there are a large number of flows with a size of 3 (i.e., similar to flow A), with SRPT, flow C would be delayed until all these short flows are completed. Performance of long flows may be hurt because they are scheduled later when competing with short flows. Unfortunately, many online services follow the partition-aggregate pattern, in which constructing the final response requires all intermediate results from smaller tasks. These intermediate results are flows with different sizes. Hence, the user-perceived latency depends on tail FCT as well as average FCT [13]. Therefore, SRPT may starve long flows and thus impair user experience. PS. PS focuses on achieving fairness between competing flows. If we apply PS to the above example, the switch would divide the outgoing link bandwidth equally among three flows, as shown in Fig. 1b. PS can achieve fairness between competing flows with no need to know their sizes in advance. However, the average FCT of PS may be far from the minimum. For example, the average FCT of PS in Fig. 1b is ð7:5 þ 9:5 þ 7Þ=3 ¼ 8, which is much larger than that of SRPT. Therefore, PS may hurt flow completion efficiency. In summary, current flow scheduling policies can hardly achieve flow completion efficiency (in terms of average FCT) and starvation freedom simultaneously, which are two primary goals in many applications such as web servers [14]. In this paper, we propose to schedule flows using the expansion ratio metric and design the MERP protocol. 2.2 Expansion Ratio and Design Intuition MERP is intended to minimize average FCT without starving long flows. To realize this, MERP should mimics SRPT but meanwhile guaranteeing starvation freedom. Therefore, MERP should be able to: make short flows transmit data before long flows, since SRPT prioritizes the flow with the smallest remaining size and has near-optimal average FCT; TABLE 1 Motivating Example Flow ID Arrival Time Size A 03 B 33 C 04 Fig. 1. Motivating example. (a) SRPT transmits the flow with the smallest remaining size first. (b) PS shares bandwidth between concurrent flows. (c) MERP schedules flows according to their dynamic expansion ratios. 3158 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3159 guarantee that long flows have chance to transmit TABLE 2 data when they wait for a sufficiently long time. MERP Packet Priorities MERP uses expansion ratio to scheduling flows,imply- ing that the expansion ratio of a flow should satisfy the fol- Packet type Priority lowing two conditions:(i)if two flows are delayed on the SYN,SYNACK,BOOST high same link by the same amount of time,then the expansion DATA,DATAACK,FIN,FINACK middle ratio of the shorter flow is larger than that of the other one; DET,DETACK low and (ii)the expansion ratio of a flow increases as its waiting time increases.The formal definition of the expansion ratio cannot happen in SRPT-based policies.We note that,the of a flow is as follows: average FCT of MERP is 53,which is only a slightly higher Definition 1 (Expansion ratio of a flow).Given a flow than that achieved by SRPT. With this example,we see that,MERP can gracefully scheduling strategy S and a flow f,let fet(S,f)be the actual FCT of f under S,and let fct"(f)be the optimal FCT of f navigate the tradeoff between SRPT and PS.Under MERP, when it monopolizes all link resources immediately after it short flows are typically scheduled before long flows like in arrives.The expansion ratio of f under S is defined as SRPT,but as time goes on,the expansion ratio of a long flow increases;in order to minimize the maximum expan- ER(S)=fet(S.f) sion ratio,MERP may prefer long flows that wait for a suffi- (1) fct*(f) ciently long time to short flows that just wait for a while.In doing so,MERP can effectively prevent any starvation. For example,in Fig.1a,we have ER(SRPT;A)=3=1and Therefore,MERP could achieve efficient flow scheduling ER(SRPT,C)=号=2.5;in Fig.1b,we have E(PS,A)= without starvation and ER(PS,C)=5=2.375. Let us look at whether Eg.(1)satisfies the above two con- 3 MERP OVERVIEW ditions.Denote the size of flow f by f.size,and the capacity of the link of interest by C.Suppose flow f is delayed by d In this section,we give an overview of MERP through pre- time before it can monopolize the link,then we have senting packet priorities,and the framework. We define 9 packet types in MERP,including SYN,DET, Cd DATA,FIN for MERP senders,and SYNACK,DETACK, ER(S,f)= fet(S.)d +1. (2) fct"(f) f.size BOOST,DATAACK,FINACK for MERP receivers.SYN/ f.size SYNACK packets are used in the handshaking phase to establish a connection.DET/DETACK packets are used by We see that,given two flows,when C and d are fixed,the the senders and receivers of postponed flows to detect net- flow with a smaller size has a larger expansion ratio;when work status.DATA/DATAACK packets are used to trans- C and f.size are fixed,the expansion ratio of any flow mit application-layer data.BOOST packets are used by a increases as d increases.Thus,Eq.(1)indeed satisfies the postponed flow receiver to quickly notify the corresponding two conditions. sender that it can transmit DATA packets immediately. We now show how MERP schedules competing flows FIN/FINACK packets are used to close a connection. based on this metric.We first extend the definition of expan- We have 3 different packet priorities,as shown in Table 2. sion ratio to a set of flows. To reduce the connection setup delay of each flow and the Definition 2 (Expansion ratio of a set of flows).Given a waiting delay of each postponed flow,SYN,SYNACK,and scheduling strategy S and a set of n flows f,f2,.....fn,the BOOST packets have the highest priority.DATA,DATAACK, expansion ratio of these flows is defined below FIN,and FINACK packets are with middle priority.To enable postponed flows to quickly perceive network status,MERP ER(S,f 1,n)=max ER(S,fi). (3) allows each postponed flow to send detecting packets with a e[1,n high rate,which is effectively controlled by MERP.However, That is,the expansion ratio of a set of flows depends on it may result in significant bandwidth overhead especially the maximum expansion ratio of one of the flows.In fact, when there are massive flows on the same link.Therefore, given a set of n competing flows,MERP tries to minimize when the buffer of a switch overflows,MERP should discard ER(S,f[1,n])(rather than the average FCT as in SRPT). the least important packets.Which kinds of packets are the Equivalently, least important?Obviously,DET and DETACK packets, implying they have the lowest priority. MERP arg min ER(S,f[1,n]). (4) MERP provides route control as well as rate control. Fig.2 shows the framework of MERP.MERP switch main- Take the scenario in Table 1 for example.If we use SRPT,tains a flow table that contains the state information of cur- then the expansion ratios of flows A,B,and C are=1, rent active flows that pass through it.Different packet types =1,and 0=2.5,respectively;the expansion ratio of them trigger different actions/algorithms on MERP switches. is 2.5.Can we do better?If we use MERP that tries to mini- Note that,the five types of ACK packets do not trigger mize the maximum expansion ratio,the optimal scheduling actions on switches. is shown in Fig.1c,and the expansion ratio of them is We The rate control in MERP controls the sending rate of find that,although flow C has a larger size than flow B,it each flow.MERP decides the transmitting order of compet- monopolizes the link bandwidth before flow B does,which ing flows through minimizing the maximum expansion
guarantee that long flows have chance to transmit data when they wait for a sufficiently long time. MERP uses expansion ratio to scheduling flows, implying that the expansion ratio of a flow should satisfy the following two conditions: (i) if two flows are delayed on the same link by the same amount of time, then the expansion ratio of the shorter flow is larger than that of the other one; and (ii) the expansion ratio of a flow increases as its waiting time increases. The formal definition of the expansion ratio of a flow is as follows: Definition 1 (Expansion ratio of a flow). Given a flow scheduling strategy S and a flow f, let fctðS; fÞ be the actual FCT of f under S, and let fctðfÞ be the optimal FCT of f when it monopolizes all link resources immediately after it arrives. The expansion ratio of f under S is defined as ERðS; fÞ ¼ fctðS; fÞ fctðfÞ : (1) For example, in Fig. 1a, we have ERðSRPT; AÞ ¼ 3 3 ¼ 1 and ERðSRPT; CÞ ¼ 10 4 ¼ 2:5; in Fig. 1b, we have ERðPS; AÞ ¼ 7 3 and ERðPS; CÞ ¼ 9:5 4 ¼ 2:375. Let us look at whether Eq. (1) satisfies the above two conditions. Denote the size of flow f by f:size, and the capacity of the link of interest by C. Suppose flow f is delayed by d time before it can monopolize the link, then we have ERðS; fÞ ¼ fctðS; fÞ fctðfÞ ¼ d þ f:size C f:size C ¼ Cd f:size þ 1: (2) We see that, given two flows, when C and d are fixed, the flow with a smaller size has a larger expansion ratio; when C and f:size are fixed, the expansion ratio of any flow increases as d increases. Thus, Eq. (1) indeed satisfies the two conditions. We now show how MERP schedules competing flows based on this metric. We first extend the definition of expansion ratio to a set of flows. Definition 2 (Expansion ratio of a set of flows). Given a scheduling strategy S and a set of n flows f1, f2, ..., fn, the expansion ratio of these flows is defined below ERðS; f½1; nÞ ¼ max i2½1;n ERðS; fiÞ: (3) That is, the expansion ratio of a set of flows depends on the maximum expansion ratio of one of the flows. In fact, given a set of n competing flows, MERP tries to minimize ERðS; f½1; nÞ (rather than the average FCT as in SRPT). Equivalently, MERP ¼ arg minS ERðS; f½1; nÞ: (4) Take the scenario in Table 1 for example. If we use SRPT, then the expansion ratios of flows A, B, and C are 3 3 ¼ 1, 3 3 ¼ 1, and 10 4 ¼ 2:5, respectively; the expansion ratio of them is 2:5. Can we do better? If we use MERP that tries to minimize the maximum expansion ratio, the optimal scheduling is shown in Fig. 1c, and the expansion ratio of them is 7 3. We find that, although flow C has a larger size than flow B, it monopolizes the link bandwidth before flow B does, which cannot happen in SRPT-based policies. We note that, the average FCT of MERP is 5 2 3, which is only a slightly higher than that achieved by SRPT. With this example, we see that, MERP can gracefully navigate the tradeoff between SRPT and PS. Under MERP, short flows are typically scheduled before long flows like in SRPT, but as time goes on, the expansion ratio of a long flow increases; in order to minimize the maximum expansion ratio, MERP may prefer long flows that wait for a suffi- ciently long time to short flows that just wait for a while. In doing so, MERP can effectively prevent any starvation. Therefore, MERP could achieve efficient flow scheduling without starvation. 3 MERP OVERVIEW In this section, we give an overview of MERP through presenting packet priorities, and the framework. We define 9 packet types in MERP, including SYN, DET, DATA, FIN for MERP senders, and SYNACK, DETACK, BOOST, DATAACK, FINACK for MERP receivers. SYN/ SYNACK packets are used in the handshaking phase to establish a connection. DET/DETACK packets are used by the senders and receivers of postponed flows to detect network status. DATA/DATAACK packets are used to transmit application-layer data. BOOST packets are used by a postponed flow receiver to quickly notify the corresponding sender that it can transmit DATA packets immediately. FIN/FINACK packets are used to close a connection. We have 3 different packet priorities, as shown in Table 2. To reduce the connection setup delay of each flow and the waiting delay of each postponed flow, SYN, SYNACK, and BOOST packets have the highest priority. DATA, DATAACK, FIN, and FINACK packets are with middle priority. To enable postponed flows to quickly perceive network status, MERP allows each postponed flow to send detecting packets with a high rate, which is effectively controlled by MERP. However, it may result in significant bandwidth overhead especially when there are massive flows on the same link. Therefore, when the buffer of a switch overflows, MERP should discard the least important packets. Which kinds of packets are the least important? Obviously, DET and DETACK packets, implying they have the lowest priority. MERP provides route control as well as rate control. Fig. 2 shows the framework of MERP. MERP switch maintains a flow table that contains the state information of current active flows that pass through it. Different packet types trigger different actions/algorithms on MERP switches. Note that, the five types of ACK packets do not trigger actions on switches. The rate control in MERP controls the sending rate of each flow. MERP decides the transmitting order of competing flows through minimizing the maximum expansion TABLE 2 MERP Packet Priorities Packet type Priority SYN, SYNACK, BOOST high DATA, DATAACK, FIN, FINACK middle DET, DETACK low ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3159
3160 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 28,NO.11.NOVEMBER 2017 Sender MERP Switch Receiver in Section 4.3.The detecting rate threshold is used by pkt.ast Route Control Rate Control pkt.es switches to limit the sending rate of DET packets. Adjust GPE(P.pkt) RateAdj(pkt) MERP sender tries to establish a connection by sending Rate SYN packets.If the asr in the SYNACK or DETACK packets pkt.a Copy GMER(P,pkt VDG(pkt) is larger than then the sender transmits DATA packets, otherwise it transmits DET packets,both of which are trans- mitted with a rate of asr.When receiving BOOST packets, the sender immediately starts to transmit DATA packets. Fig.2.MERP framework. The connection is terminated by FIN packets.Note that, MERP is preemptive,i.e.,a flow may be postponed before it ratio of them.Since different network links have different completes its transmission. contentions and capacities,we must synchronize such kind Retransmission is controlled by retransmission timeout of heterogeneity between switches,which is achieved by (RTO).We adopt a similar approach to estimate round-trip inserting some flow states into packet headers (e.g.,pkt.asr time (RTT)and RTO as that in TCP [19]:the exponential and pkt.esr in Fig.2).MERP also reserves some bandwidth weighted moving average over sample RTTs is used as the (controlled by Algorithm 3)for postponed flows to detect current RTT,and RTO is set to the sum of the current RTT current network status,and flows with different virtual and four times RTT deviation. deadlines have different detecting rates.The details can be found in Section 4. 4.2 MERP Receiver The route control in MERP is responsible for selecting a When receiving a SYN/DATA/FIN packet,the receiver path for each flow in the handshaking phase.When a SYN copies asr and esr from the received packet into the corre- packet arrives,MERP first selects a subset of next-hops as sponding ACK packet,and send the ACK packet.When candidates by minimizing the expansion ratio of this flow, receiving a DET packet,if the asr in the DET packet is larger then chooses one of the candidate paths by maximizing the than 6,a BOOST packet will be sent back other than a expected sending rate of the tail flow on that path.In multi- DETACK packet,otherwise,the receiver does the same as path data center topologies,ECMP is the dominating proto- that for SYN/DATA/FIN packet. col for load balancing.Compared with ECMP,the two-stage route control of MERP has several desirable properties. 4.3 MERP Switch First,it is immune to hash collision;second,it is not sensi- To minimize the maximum expansion ratio of competing tive to flow size distribution;and lastly,it reacts quickly to flows,MERP must consider how to synchronize rate control link failures.The details can be found in Section 5. decisions among MERP switches in a distributed way.To achieve this,we design three main components in each 4 RATE CONTROL MERP switch: In this section,we assume the route path for each flow is Distributed rate controller (DRC):this component known ahead of time and focus on presenting the rate control decides the actions a MERP switch takes when there component.We first introduce MERP sender(Section 4.1), is a packet arriving at a switch. receiver (Section 4.2),and switch (Section 4.3),then briefly ● Virtual deadline generator (VDG):whenever a summarize this section by discussions (Section 4.4).We will MERP switch receives a SYN packet (i.e.,a new flow present the route control component in the next section. arrives),VDG is invoked to update virtual deadlines of all flows. 4.1 MERP Sender Rate adjustment(RateAdi):whenever a SYN,DATA, MERP is implemented as a distributed protocol to keep the or DET packet leaves a MERP switch,RateAdj is order in which competing flows are scheduled on different invoked to update the actual sending rate in the switches consistent.But how to make this happen?MERP packet header. utilize the flow states inserted in packet headers to synchro- nize decisions from different switches. 4.3.1 Distributed Rate Controller MERP sender inserts four flow states into each MERP Each MERP switch maintains a flow state table T that stores packet header:flow size(size),actual sending rate (asr), the ID (id),flow size (size),size of the rest of the flow (rs), expected sending rate (esr),and detecting rate threshold arrival time (aT),virtual deadline (udl),and expected send- ()Similar to previous studies [2],[61,[15],[16],[17],[18], ing rate (esr)for each flow f.When there is a packet arriving we assume flow sizes are known ahead of time at the trans- at a switch,if the buffer(priority queue)is not full,MERP port layer.The actual sending rate is updated by intermedi- adds the new packet into the queue,otherwise,MERP dis- ate switches and used by a MERP sender to send DATA or cards the packet with the lowest priority in the queue and DET packets using this rate.It is initialized to the capacity then adds the new packet into the queue.As for dequeue, of the link that is adjacent to the sender.The expected send-MERP always takes out the packet with the highest priority. ing rate is updated by intermediate switches and used by Algorithm 1 gives the details of the Distributed Rate Con- these switches to synchronize their decisions on flow sched-troller.When a switch receives a packet,if the type of this uling.It is initialized to infinity,and it is nof used by any packet is FIN,then we delete the corresponding flow states MERP sender to adjust its sending rate of DATA or DET from the state table T.If it is a SYN packet,we need to packets.We will explain how to update asr and esr shortly update virtual deadlines of all flows using Algorithm 2
ratio of them. Since different network links have different contentions and capacities, we must synchronize such kind of heterogeneity between switches, which is achieved by inserting some flow states into packet headers (e.g., pkt:asr and pkt:esr in Fig. 2). MERP also reserves some bandwidth (controlled by Algorithm 3) for postponed flows to detect current network status, and flows with different virtual deadlines have different detecting rates. The details can be found in Section 4. The route control in MERP is responsible for selecting a path for each flow in the handshaking phase. When a SYN packet arrives, MERP first selects a subset of next-hops as candidates by minimizing the expansion ratio of this flow, then chooses one of the candidate paths by maximizing the expected sending rate of the tail flow on that path. In multipath data center topologies, ECMP is the dominating protocol for load balancing. Compared with ECMP, the two-stage route control of MERP has several desirable properties. First, it is immune to hash collision; second, it is not sensitive to flow size distribution; and lastly, it reacts quickly to link failures. The details can be found in Section 5. 4 RATE CONTROL In this section, we assume the route path for each flow is known ahead of time and focus on presenting the rate control component. We first introduce MERP sender (Section 4.1), receiver (Section 4.2), and switch (Section 4.3), then briefly summarize this section by discussions (Section 4.4). We will present the route control component in the next section. 4.1 MERP Sender MERP is implemented as a distributed protocol to keep the order in which competing flows are scheduled on different switches consistent. But how to make this happen? MERP utilize the flow states inserted in packet headers to synchronize decisions from different switches. MERP sender inserts four flow states into each MERP packet header: flow size (size), actual sending rate (asr), expected sending rate (esr), and detecting rate threshold (Q). Similar to previous studies [2], [6], [15], [16], [17], [18], we assume flow sizes are known ahead of time at the transport layer. The actual sending rate is updated by intermediate switches and used by a MERP sender to send DATA or DET packets using this rate. It is initialized to the capacity of the link that is adjacent to the sender. The expected sending rate is updated by intermediate switches and used by these switches to synchronize their decisions on flow scheduling. It is initialized to infinity, and it is not used by any MERP sender to adjust its sending rate of DATA or DET packets. We will explain how to update asr and esr shortly in Section 4.3. The detecting rate threshold is used by switches to limit the sending rate of DET packets. MERP sender tries to establish a connection by sending SYN packets. If the asr in the SYNACK or DETACK packets is larger than Q, then the sender transmits DATA packets, otherwise it transmits DET packets, both of which are transmitted with a rate of asr. When receiving BOOST packets, the sender immediately starts to transmit DATA packets. The connection is terminated by FIN packets. Note that, MERP is preemptive, i.e., a flow may be postponed before it completes its transmission. Retransmission is controlled by retransmission timeout (RTO). We adopt a similar approach to estimate round-trip time (RTT) and RTO as that in TCP [19]: the exponential weighted moving average over sample RTTs is used as the current RTT, and RTO is set to the sum of the current RTT and four times RTT deviation. 4.2 MERP Receiver When receiving a SYN/DATA/FIN packet, the receiver copies asr and esr from the received packet into the corresponding ACK packet, and send the ACK packet. When receiving a DET packet, if the asr in the DET packet is larger than Q, a BOOST packet will be sent back other than a DETACK packet, otherwise, the receiver does the same as that for SYN/DATA/FIN packet. 4.3 MERP Switch To minimize the maximum expansion ratio of competing flows, MERP must consider how to synchronize rate control decisions among MERP switches in a distributed way. To achieve this, we design three main components in each MERP switch: Distributed rate controller (DRC): this component decides the actions a MERP switch takes when there is a packet arriving at a switch. Virtual deadline generator (VDG): whenever a MERP switch receives a SYN packet (i.e., a new flow arrives), VDG is invoked to update virtual deadlines of all flows. Rate adjustment (RateAdj): whenever a SYN, DATA, or DET packet leaves a MERP switch, RateAdj is invoked to update the actual sending rate in the packet header. 4.3.1 Distributed Rate Controller Each MERP switch maintains a flow state table T that stores the ID (id), flow size (size), size of the rest of the flow (rs), arrival time (aT), virtual deadline (vdl), and expected sending rate (esr) for each flow f. When there is a packet arriving at a switch, if the buffer (priority queue) is not full, MERP adds the new packet into the queue, otherwise, MERP discards the packet with the lowest priority in the queue and then adds the new packet into the queue. As for dequeue, MERP always takes out the packet with the highest priority. Algorithm 1 gives the details of the Distributed Rate Controller. When a switch receives a packet, if the type of this packet is FIN, then we delete the corresponding flow states from the state table T. If it is a SYN packet, we need to update virtual deadlines of all flows using Algorithm 2. Fig. 2. MERP framework. 3160 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 28, NO. 11, NOVEMBER 2017
ZHANG ET AL.:EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3161 Algorithm 1.Distributed Rate Controller,DRC(pkt E f) assume there are n flows that are passing through a switch of interest,F={fi,f2....,fn}.Note that,VDG does not need Input:flow state table T to know this set of flows ahead of time.Denote by Bi the set 1:if pkt.type FIN then of flows that are scheduled before fi,and by A;the set of 2 T.delete(f) flows that are scheduled after fi.Then,we can simply use 3: return [Bi,fi,Ai}to represent a scheduling order.Let us further 4:end if denote the link capacity is C,and the current time is cT. 5:if pkt.type SYN then According to Eq.(2),the expansion ratio of fi with respect to 6:VDG(pkt) DAlgorithm 2 7:end if a scheduling order [Bi,fi,Ai}can be represented as 8:if pkt.type=SYN/DATA/DET then 9 pkt.esr -minipkt.esr.T.f.esr ER({Bi,fi,Ai),fi) 10: T.f.esr←-pkt.esr + 11: T.Jxl-T.f.aT+器 -eT-f.四+25 ● 12: (5) RateAdi(pkt) Algorithm 3 fisise 13:end if 14:return (cT-i.aT)C+∑eurs fi.size Algorithm3.Rate Adjustment,,RateAdj(pkt∈f月 Algorithm 2.Virtual Deadline Generator,VDG(pkt E f) 1:if f.udl is the smallest then Input:flow state table T,link capacity C,and current time cT 2: pkt.asr←-min{a·C,pkt.asr Output:new flow state table T 3:else 1:f.aT←-cT 4: if f.udl is the second smallest then 2:T.insert(f) 3:tRestSize-Eherfirs 5 fi-the flow with the smallest virtual deadline 6: if fr.rs a.C.RIT then 4:T←-0 7: pkt.asr-minfaC.pkt.asr Dearly start 5:while T≠0do 8: else 5 ninExpRatio←-oo 9 pkt.asr←-min{(n-1)-(1-a)·C,O,pkt.asr} 7: for each flow fi∈Tdo expRatioT-faT)-C+tRestsize 10: end if 8: i-i20 11: else 9 if expRatio minExpRatio then 12: pkt.asr -min{(1-a).C,,pkt.asr} 10: minExpRatio -expRatio 13: end if 11: lastFlow fi 14:end if 12: end if 13: end for It is not hard to see that,ER({Bi,fi,Ai},fi)is maximized 14: T.delete(lastFlow) when Bi=F\{fi}and Ai=0.That is,when the set of 15: lastFlow.vdl-cT+tRestsize flows is fixed,the expansion ratio of a flow is maximized if 16: asFow..esr←m2器aa it is scheduled as the last flow,which also means,the maxi- 17: T'insert(last Flow) mum expansion ratio of these n flows is then bounded by 18: tRestSize-tRestSize-last Flow.rs 19:end while max ER(Fi,fi,0) (6) ie{12.n} 20:return T We then have an optimal algorithm,as shown in However,different network links along the path between Algorithm 2,to determine the scheduling order of a set of a pair of sender and receiver may have heterogeneous capacities,hence,a flow may have different virtual dead- flows that leads to the minimum maximum expansion ratio. The basic idea is to find the flow that could be scheduling as lines on different switches.We leverage esr to make these virtual deadlines consistent.Lines 8-13 show the trick. the last flow with the smallest expansion ratio.Specifically, for each flow,we compute its expansion ratio by assuming it Whenever there is a SYN,DATA,or DET packet,we update is the last flow,then the flow with the smallest expansion ratio the esr in the packet header and the esr in the flow state is set as the last flow;by doing so,we reduce the problem size table T to be the minimum of them,recalculate the virtual by one,and we repeat this process on the rest(n-1)flows. deadline udl,and update the actual sending rate asr using Here are some brief notes on Algorithm 2.Line 3 gives Algorithm 3. the total size of the rest of flows;lines 6-13 determine the last flow that incurs the smallest expansion ratio;line 8 is 4.3.2 Virtual Deadline Generator due to Eq.(5);line 15 updates the virtual deadline of the last Each MERP switch updates the flow state table using the flow,and this information will be used to determine the virtual deadline generator whenever there is a new flow actual sending rate of each flow (see Algorithm 3);line 16 (indicated by SYN packets). updates the expected sending rate,and this information is The objective of VDG is to minimize the maximum ratio used to synchronize virtual deadlines among different of a set of flows.Let us first analyse what the expansion switches (see lines 8-13 in Algorithm 1).Denote by t(n)the ratio of flow fi looks like.Without loss of generality,we running time of Algorithm 2 with respect to the number of
Algorithm 1. Distributed Rate Controller, DRC(pkt 2 f) Input: flow state table T 1: if pkt:type = FIN then 2: T.delete(f) 3: return 4: end if 5: if pkt:type = SYN then 6: VDG(pkt) " Algorithm 2 7: end if 8: if pkt:type = SYN/DATA/DET then 9: pkt:esr minfpkt:esr; T:f:esrg 10: T:f:esr pkt:esr 11: T:f:vdl T:f:aT þ T:f:size T:f:esr 12: RateAdj(pkt) " Algorithm 3 13: end if 14: return Algorithm 2. Virtual Deadline Generator, VDG(pkt 2 f) Input: flow state table T, link capacity C, and current time cT Output: new flow state table T0 1: f:aT cT 2: T.insert(f) 3: tRestSize P fi2T fi:rs 4: T0 ; 5: while T 6¼ ; do 6: minExpRatio 1 7: for each flow fi 2 T do 8: expRatio ðcTfi:aTÞCþtRestSize fi:size 9: if expRatio < minExpRatio then 10: minExpRatio expRatio 11: lastFlow fi 12: end if 13: end for 14: T.delete(lastFlow) 15: lastFlow:vdl cT þ tRestSize C 16: lastFlow:esr lastFlow:size lastFlow:vdllastFlow:aT 17: T0 .insert(lastFlow) 18: tRestSize tRestSize lastFlow:rs 19: end while 20: return T0 However, different network links along the path between a pair of sender and receiver may have heterogeneous capacities, hence, a flow may have different virtual deadlines on different switches. We leverage esr to make these virtual deadlines consistent. Lines 8-13 show the trick. Whenever there is a SYN, DATA, or DET packet, we update the esr in the packet header and the esr in the flow state table T to be the minimum of them, recalculate the virtual deadline vdl, and update the actual sending rate asr using Algorithm 3. 4.3.2 Virtual Deadline Generator Each MERP switch updates the flow state table using the virtual deadline generator whenever there is a new flow (indicated by SYN packets). The objective of VDG is to minimize the maximum ratio of a set of flows. Let us first analyse what the expansion ratio of flow fi looks like. Without loss of generality, we assume there are n flows that are passing through a switch of interest, F ¼ ff1; f2; ... ; fng. Note that, VDG does not need to know this set of flows ahead of time. Denote by Bi the set of flows that are scheduled before fi, and by Ai the set of flows that are scheduled after fi. Then, we can simply use fBi; fi; Aig to represent a scheduling order. Let us further denote the link capacity is C, and the current time is cT. According to Eq. (2), the expansion ratio of fi with respect to a scheduling order fBi; fi; Aig can be represented as ERðfBi; fi; Aig; fiÞ ¼ ðcT fi:aTÞ þ P fj2Bi fj:rs C þ fi:rs C fi:size C ¼ ðcT fi:aTÞC þ P fj2Bi[ffig fj:rs fi:size : (5) Algorithm 3. Rate Adjustment, RateAdj(pkt 2 f) 1: if f:vdl is the smallest then 2: pkt:asr minfa C; pkt:asrg 3: else 4: if f:vdl is the second smallest then 5: fk the flow with the smallest virtual deadline 6: if fk:rs < a C RTT then 7: pkt:asr minfa C; pkt:asrg " early start 8: else 9: pkt:asr minfðn 1Þð1 aÞ C; Q; pkt:asrg 10: end if 11: else 12: pkt:asr minfð1 aÞ C; Q; pkt:asrg 13: end if 14: end if It is not hard to see that, ERðfBi; fi; Aig; fiÞ is maximized when Bi ¼ F n ffig and Ai ¼ ;. That is, when the set of flows is fixed, the expansion ratio of a flow is maximized if it is scheduled as the last flow, which also means, the maximum expansion ratio of these n flows is then bounded by max i2f1;2;...;ng ERðF n ffig; fi; ;Þ: (6) We then have an optimal algorithm, as shown in Algorithm 2, to determine the scheduling order of a set of flows that leads to the minimum maximum expansion ratio. The basic idea is to find the flow that could be scheduling as the last flow with the smallest expansion ratio. Specifically, for each flow, we compute its expansion ratio by assuming it is the last flow, then the flow with the smallest expansion ratio is set as the last flow; by doing so, we reduce the problem size by one, and we repeat this process on the restðn 1Þ flows. Here are some brief notes on Algorithm 2. Line 3 gives the total size of the rest of flows; lines 6-13 determine the last flow that incurs the smallest expansion ratio; line 8 is due to Eq. (5); line 15 updates the virtual deadline of the last flow, and this information will be used to determine the actual sending rate of each flow (see Algorithm 3); line 16 updates the expected sending rate, and this information is used to synchronize virtual deadlines among different switches (see lines 8-13 in Algorithm 1). Denote by tðnÞ the running time of Algorithm 2 with respect to the number of ZHANG ET AL.: EFFICIENT DATA CENTER FLOW SCHEDULING WITHOUT STARVATION USING EXPANSION RATIO 3161