2009 15th International Conference on Parallel and Distributed Systems A Decentralized Storage Scheme for Multi-Dimensional Range Queries over Sensor Networks Lei Xie,Lijun Chen,Daoxu Chen,Li Xie State Key Laboratory of Novel Software Technology Department of Computer Science,Nanjing University,Nanjing,China Email:xielei@dislab.nju.edu.cn,[chenlj.cdx,xieli @nju.edu.cn Abstract-This paper presents the design of a decentralized efficiently handle the event insertions and range query pro- storage scheme to support multi-dimensional range queries over cessing in an optimized approach.Contributions of this paper sensor networks.We build a distributed k-d tree based index are listed as follows: structure over sensor network,so as to efficiently map high dimensional event data to a two-dimensional space of sensors 1.We construct the k-d tree based index structure according while preserving the proximity of events.We propose a dynamic to event and range query distributions at sensor data level programming based methodology to control the granularity of instead of the two-dimensional sensor locations,thus we the index tree in an optimized approach,and an optimized can well utilize data level information to facilitate the event routing scheme for range query processing to achieve best energy efficiency.The simulation results demonstrate the efficiency of the insertion and query processing. design. 2.We reduce the problem of optimal routing for sub-queries to the Euclidean Steiner tree problem,which is NP-hard,and I.INTRODUCTION propose corresponding approximation algorithm to solve it. The rest of the paper is organized as follows.Section II In most sensor network applications,data or events are reviews the related work.Section IlI provides preliminaries continuously collected,forwarded and stored in sink nodes upon which the design theory can be built.The details of for further usage.Recently a Data Centric Storage (DCS) the proposed design methodology are given in Section IV. scheme is proposed to store event data in specific sensor nodes In Section V,we present the multi-dimensional event storage according to the "event type",leveraging the"in-network stor- and query processing mechanism.Section VI presents the age"capability of sensor networks.The "event type"refers to simulation results.Finally,we summarize the paper in Section certain pre-defined constellations of attribute values,however, VII. earlier DCS schemes like GHT [1]conventionally deal with event types with unique attribute and only support "exact II.RELATED WORKS match"point queries.As sensor applications are evolving. Data centric storage schemes [2]mainly utilize a mapping it is mandatory for these event data to be represented by function such as geographic hash function to map event in- several different attributes,such as temperature,humidity, formation to specified geographic points,and leverage routing light,and barometric pressure,which may be referred to as algorithms like GPSR [1]to route messages to corresponding multi-dimensional events.One natural way to query for events storage nodes,these research works include GHT [1].DIMEN- of interest is to use multi-dimensional range queries over SIONs [3]and DIFS [4].GHT is the first work which utilizes these attributes.For example,users may impose queries like a geographic hash table as the mapping function.Based on "List all events that have temperatures between 50.F and 60.F, the primitives in GHT,DIMENSIONs is proposed to support and light levels between 10 and 20".However conventional multi-resolution storage and query processing,which allows DCS schemes are inadequate for managing and processing drill-down search for features within sensor network,while these multi-dimensional events,because processing a query DIFS allows range queries on a single key in addition to over multi-dimensional events is much more complex than other operations.And these above storage schemes are mainly processing a query on single-dimensional events.We need to suitable for processing one-dimensional events. figure out how to efficiently map these high dimensional events The basic problem that this paper addresses-multidimen- to a two-dimensional space of sensors while preserving the sional range queries-is typically solved in database systems proximity of events. using indexing techniques.Among these techniques k-d tree[5] Therefore an efficient,decentralized storage framework is as one of the well-known multi-dimensional binary search essential to facilitate multi-dimensional queries over sensor trees,is proposed to support associative searching works over networks.In this paper we propose a framework which can multi-dimensional data attributes.DIM6]draws its inspiration IEEE 1521-9097/09$26.0002009IEEE 166 Φcomputer DOI10.1109/ICPADS.2009.24 society Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
A Decentralized Storage Scheme for Multi-Dimensional Range Queries over Sensor Networks Lei Xie, Lijun Chen, Daoxu Chen, Li Xie State Key Laboratory of Novel Software Technology Department of Computer Science, Nanjing University, Nanjing, China Email: xielei@dislab.nju.edu.cn, {chenlj,cdx,xieli}@nju.edu.cn Abstract—This paper presents the design of a decentralized storage scheme to support multi-dimensional range queries over sensor networks. We build a distributed k-d tree based index structure over sensor network, so as to efficiently map high dimensional event data to a two-dimensional space of sensors while preserving the proximity of events. We propose a dynamic programming based methodology to control the granularity of the index tree in an optimized approach, and an optimized routing scheme for range query processing to achieve best energy efficiency. The simulation results demonstrate the efficiency of the design. I. INTRODUCTION In most sensor network applications, data or events are continuously collected, forwarded and stored in sink nodes for further usage. Recently a Data Centric Storage (DCS) scheme is proposed to store event data in specific sensor nodes according to the “event type”, leveraging the “in-network storage” capability of sensor networks. The “event type” refers to certain pre-defined constellations of attribute values, however, earlier DCS schemes like GHT [1] conventionally deal with event types with unique attribute and only support “exact match” point queries. As sensor applications are evolving, it is mandatory for these event data to be represented by several different attributes, such as temperature, humidity, light, and barometric pressure, which may be referred to as multi-dimensional events. One natural way to query for events of interest is to use multi-dimensional range queries over these attributes. For example, users may impose queries like “List all events that have temperatures between 50.F and 60.F, and light levels between 10 and 20”. However conventional DCS schemes are inadequate for managing and processing these multi-dimensional events, because processing a query over multi-dimensional events is much more complex than processing a query on single-dimensional events. We need to figure out how to efficiently map these high dimensional events to a two-dimensional space of sensors while preserving the proximity of events. Therefore an efficient, decentralized storage framework is essential to facilitate multi-dimensional queries over sensor networks. In this paper we propose a framework which can efficiently handle the event insertions and range query processing in an optimized approach. Contributions of this paper are listed as follows: 1. We construct the k-d tree based index structure according to event and range query distributions at sensor data level instead of the two-dimensional sensor locations, thus we can well utilize data level information to facilitate the event insertion and query processing. 2. We reduce the problem of optimal routing for sub-queries to the Euclidean Steiner tree problem, which is NP-hard, and propose corresponding approximation algorithm to solve it. The rest of the paper is organized as follows. Section II reviews the related work. Section III provides preliminaries upon which the design theory can be built. The details of the proposed design methodology are given in Section IV. In Section V, we present the multi-dimensional event storage and query processing mechanism. Section VI presents the simulation results. Finally, we summarize the paper in Section VII. II. RELATED WORKS Data centric storage schemes [2] mainly utilize a mapping function such as geographic hash function to map event information to specified geographic points, and leverage routing algorithms like GPSR [1] to route messages to corresponding storage nodes, these research works include GHT [1], DIMENSIONs [3] and DIFS [4]. GHT is the first work which utilizes a geographic hash table as the mapping function. Based on the primitives in GHT, DIMENSIONs is proposed to support multi-resolution storage and query processing, which allows drill-down search for features within sensor network, while DIFS allows range queries on a single key in addition to other operations. And these above storage schemes are mainly suitable for processing one-dimensional events. The basic problem that this paper addresses - multidimensional range queries - is typically solved in database systems using indexing techniques. Among these techniques k-d tree[5] as one of the well-known multi-dimensional binary search trees, is proposed to support associative searching works over multi-dimensional data attributes. DIM[6] draws its inspiration 2009 15th International Conference on Parallel and Distributed Systems 1521-9097/09 $26.00 © 2009 IEEE DOI 10.1109/ICPADS.2009.24 166 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
from the k-d tree index structure,and embeds the k-d tree at 0.25 and 0.75 which divide the ranges to [0,0.25),[0.25 into sensor networks in a decentralized approach.Furthermore, 0.5),[0.5,0.75),and [0.75,1).They repeat this procedure KDDCS [7]presents a load-balanced storage scheme over until it reaches the corresponding leaf node.As an example. sensor networks based on k-d tree for multi-dimensional range consider event E =(0.3,0.8),for this event,the zone code queries.POOL[8]addresses some weakness of DIM such over the aforementioned example k-d tree is code(ZE)=011. as hot spot and scalability problems,and further proposes When answering a range query,DIM finds all zones whose a storage scheme to group index nodes together as a data value ranges overlap with the query range and sends the pool to preserve the neighborhood property of nearby multi- query to index nodes of those zones.For example,consider dimensional data,however,their group mechanism are only a range query(0.3,0.7]0.5,0.d,[0.4,0.8,0.7,0.9]),the based on the greatest and the second greatest attribute values, storage regions of zone code 010,011,110,1111 will be which has not been proved to be efficient enough for high involved. dimensional data processing. Building upon this,DIM achieves energy efficiency by forcing events whose attribute values are "close"to be stored III.PRELIMINARIES at the same or nearby nodes.However,some performance A.Motivation issues still remain to be challenged for the traditional DIM: First,DIM constructs the k-d tree structure only based on the two-dimensional sensor locations,without leveraging the estimated distributions of various data attributes and range queries,thus it may cause inefficiency for non-uniform event 110 010 01 insertions and corresponding range queries,which are actually 品 conventional for sensor network applications.Second,DIM assigns a unique zone for each sensor node,hence as the network size increases,the zones have to be further split into smaller zones,more zones of sensors are inevitably involved 100 101 in the query processing,which cannot be energy efficient (a)DIM's zone tree (b)The zone node and boundaries of DIM for range query processing.Therefore how to control the granularity of the k-d tree based index structure is crucial Fig.1.Example of DIM(Distributed Index for Multi-dimensional data for energy efficiency issue.Third,DIM has not considered an optimized approach to route those split sub-queries to specified storage areas,since the query diffusion cost dominates the It is essential to construct an index structure to map overall energy consumption for multi-dimensional range query the multi-dimensional event data to two-dimensional sensors. processing,it is of utmost importance to devise an optimized Obviously a centralized index for multi-dimensional range routing scheme for query forwarding. queries is not feasible for energy-efficiency reasons of sen- Therefore in this paper we try to address the problems raised sor networks.DIM [6]is the first work to support multi- above and we propose a framework which can efficiently dimensional range queries in sensor networks,it utilizes a handle the event insertions and range query processing in an classical database index,the k-d tree [5],and embeds such optimized approach.Our storage framework is also based on an index in a sensor network.DIM divides sensor nodes into the k-d tree and we mainly focus on achieving the optimized separate zones and each zone contains one exact sensor node granularity of k-d tree structure over sensor networks for best as the index node.Fig.1(a)depicts an example of DIM's k-d energy efficiency,and we further investigate into the optimized tree for zones and Fig.I(b)shows the corresponding zones scheme for range query forwarding which dominates the over the sensor network.Each intermediate node over this overall energy consumption. k-d tree based index structure denotes the split parameters for further branching,which include the specified attribute B.Modelling Event Data Distribution Ai(1 i<m)from the attribute list (A1,...,Am),and Consider there exist k attributes A1,A2,...,Ak for sensor associative splitting ranges for Ai.Each leaf node represents data,for each attribute Ai,we have minimum value min(Vi) the event data within specified multi-dimensional ranges. and maximum value max(Vi),we normalize attribute A;'s We introduce the principle of k-d tree to process multi- value into the region [0,1]by calculating: dimensional attribute data with the following example:For = Vi-min(Vi) attributes (A1,...,Am),assume that the maximum depth of the max(V))-min(V (1) k-d tree is k,k is a multiple of m,and this value of k is known to every node.DIM assigns a k bit zone code to an event as And we represent the data distribution over attribute A;as follows.For i between I and m,if Ai<0.5,the i-th bit of Pi(r),which is exactly a probability density function (pdf) the zone code is assigned 0,else 1.For i between m+1 and over the region [0,1].thus we have 2m,if Ai-m <0.25 or Ai-m E 0.5,0.75),the i-th bit of the zone is assigned 0,else 1,because the next level divisions are Pi(x)dx =1 (2) 167 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
from the k-d tree index structure, and embeds the k-d tree into sensor networks in a decentralized approach. Furthermore, KDDCS [7] presents a load-balanced storage scheme over sensor networks based on k-d tree for multi-dimensional range queries. POOL[8] addresses some weakness of DIM such as hot spot and scalability problems, and further proposes a storage scheme to group index nodes together as a data pool to preserve the neighborhood property of nearby multidimensional data, however, their group mechanism are only based on the greatest and the second greatest attribute values, which has not been proved to be efficient enough for high dimensional data processing. III. PRELIMINARIES A. Motivation Fig. 1. Example of DIM (Distributed Index for Multi-dimensional data It is essential to construct an index structure to map the multi-dimensional event data to two-dimensional sensors. Obviously a centralized index for multi-dimensional range queries is not feasible for energy-efficiency reasons of sensor networks. DIM [6] is the first work to support multidimensional range queries in sensor networks, it utilizes a classical database index, the k-d tree [5], and embeds such an index in a sensor network. DIM divides sensor nodes into separate zones and each zone contains one exact sensor node as the index node. Fig.1(a) depicts an example of DIM’s k-d tree for zones and Fig.1(b) shows the corresponding zones over the sensor network. Each intermediate node over this k-d tree based index structure denotes the split parameters for further branching, which include the specified attribute Ai(1 ≤ i ≤ m) from the attribute list hA1, ..., Ami, and associative splitting ranges for Ai . Each leaf node represents the event data within specified multi-dimensional ranges. We introduce the principle of k-d tree to process multidimensional attribute data with the following example: For attributes hA1, ..., Ami, assume that the maximum depth of the k-d tree is k, k is a multiple of m, and this value of k is known to every node. DIM assigns a k bit zone code to an event as follows. For i between 1 and m, if Ai < 0.5, the i-th bit of the zone code is assigned 0, else 1. For i between m + 1 and 2m, if Ai−m < 0.25 or Ai−m ∈ [0.5, 0.75), the i-th bit of the zone is assigned 0, else 1, because the next level divisions are at 0.25 and 0.75 which divide the ranges to [0, 0.25), [0.25, 0.5), [0.5, 0.75), and [0.75, 1). They repeat this procedure until it reaches the corresponding leaf node. As an example, consider event E = (0.3, 0.8), for this event, the zone code over the aforementioned example k-d tree is code (ZE) = 011. When answering a range query, DIM finds all zones whose value ranges overlap with the query range and sends the query to index nodes of those zones. For example, consider a range query h[0.3, 0.7], [0.5, 0.6], [0.4, 0.8], [0.7, 0.9]i, the storage regions of zone code 010, 011, 110, 1111 will be involved. Building upon this, DIM achieves energy efficiency by forcing events whose attribute values are “close” to be stored at the same or nearby nodes. However, some performance issues still remain to be challenged for the traditional DIM: First, DIM constructs the k-d tree structure only based on the two-dimensional sensor locations, without leveraging the estimated distributions of various data attributes and range queries, thus it may cause inefficiency for non-uniform event insertions and corresponding range queries, which are actually conventional for sensor network applications. Second, DIM assigns a unique zone for each sensor node, hence as the network size increases, the zones have to be further split into smaller zones, more zones of sensors are inevitably involved in the query processing, which cannot be energy efficient for range query processing. Therefore how to control the granularity of the k-d tree based index structure is crucial for energy efficiency issue. Third, DIM has not considered an optimized approach to route those split sub-queries to specified storage areas, since the query diffusion cost dominates the overall energy consumption for multi-dimensional range query processing, it is of utmost importance to devise an optimized routing scheme for query forwarding. Therefore in this paper we try to address the problems raised above and we propose a framework which can efficiently handle the event insertions and range query processing in an optimized approach. Our storage framework is also based on the k-d tree and we mainly focus on achieving the optimized granularity of k-d tree structure over sensor networks for best energy efficiency, and we further investigate into the optimized scheme for range query forwarding which dominates the overall energy consumption. B. Modelling Event Data Distribution Consider there exist k attributes A1, A2, ..., Ak for sensor data, for each attribute Ai , we have minimum value min(Vi) and maximum value max(Vi), we normalize attribute Ai’s value into the region [0, 1] by calculating: V ′ i = Vi − min(Vi) max(Vi) − min(Vi) (1) And we represent the data distribution over attribute Ai as pi(x), which is exactly a probability density function (pdf) over the region x ∈ [0, 1], thus we have Z 1 0 pi(x)dx = 1 (2) 167 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
we can also define a discrete form of pi(r)as follows, IV.DESIGN OF DECENTRALIZED STORAGE SCHEME FOR we equally divide the region 0,1 into m sub-regions,and MULTI-DIMENSIONAL RANGE OUERIES leverage the histogram method to calculate the probability pi.j,(j=1,...,m)and we have: 01D0 ∑P=1 (3) j=1 Hence consider an index node over the k-d tree with multi- dimensional range[(a1,31),(a2,2),·,(ak,3k小,where a and B;respectively denote the lower bound and upper bound of specified range of attribute Ai,we have 0<ai<Bi<1, assume the above multi-dimensional range is associated with a two-dimensional spacial region Sj,therefore we can calculate (a)The K-d Tree based index structure (b)The corresponding zone node and the probability that a random event data falls into this region boundanes Si as: Fig.2.Decentralized Storage Scheme for Multi-Dimensional Range Queries ProbE(Sj)= Pi(x)dx (4) We visualize the deployment field of a sensor network as Assume the overall event frequency is fe,thus the frequency a number of n x n equal sized grid cells.As DIM mainly for which the specific storage region Sj is accessed is fe. applies for sparse sensor network where some zones may not Probe(Si). have any sensor node,in this paper we consider the situation C.Modelling Range Query Distribution where the node density is high enough for each cell to have at Unlike event distributions over attributes,which are mu- least one storage node candidates,these cells are nominated tually exclusive over each attribute Ai,the range query dis- as the storage units of the DCS scheme,at the center of each tributions over attributes are overlapped because each query cell one exact sensor node is selected as the storage node. is corresponding to a range over the attribute dimensions. A.Indexing Storages over Cells Users can specify an k dimensional range query in the form like ([L,U],[L2,U2],....[L,U])over k attributes,where We construct the index structure based on k-d tree over the 0≤L:≤U≤l,(l≤i≤k),for partial range query which cell-based storage deployment area,on one hand,each index does not include attribute A,,we can set Li=0,U;=1.Thus node over the index structure represents a multi-dimensional we utilize another method to depict the distributions:Given a range for the sensor data,on the other hand,it also associates training set of queries Q,we have Q=(Q1,Q2,...,QN}, with a specified spacial region over the two-dimensional where query Qi has a query range over each attribute Ai, deployment area.The definition of the index structure is con- which is (Li,Ui),hence we define an indicator variable Bi sistent with DIM,except that we utilize the cell as the storage for a specified range (i,Bi)of attribute A:as follows: unit instead of sensor node.each index node corresponds to a spacial region which covers one or more cells.Fig.2 shows an 了1if(a,3)n(Li,U)≠p B;0 otherwise (5) example of our decentralized storage scheme to support multi- dimensional range queries.Fig.2(a)depicts the k-d tree based And we further calculate the query frequency fi for the index structure and Fig.2(b)represents the corresponding zone specified range (i,Bi)over attribute Ai: code and boundaries.The black nodes denote the storage fi=∑B regions specified by the k-d tree and those cell(s)contained (6) in the corresponding regions become(s)the replicate storage HQ,∈Q cells,which means each cell will store the same copy of the Thus the specified attribute range will be queried with proba- multi-dimensional ranged data specified by the index node bility pi=fi/N. over the k-d tree. Assume query distributions over various attributes are mutu- ally independent,given a multi-dimensional data region which B.Controlling Granularity for k-d Tree based Index Structure is associated with a two-dimensional spacial region Sj,we can Now we investigate into the k-d tree based index structure, finally get the probability for the specified storage region S; as already mentioned in Section I,we note that controlling to be queried as: the granularity of the index nodes is very crucial for energy efficiency of multi-dimensional event insertion and range Proba(S)=Πn (7) query diffusion.If we split these sensor data into very refined =1 zones according to multi-dimensional data attributes,thus to Assume the overall query frequency is fo,thus the frequency resolve a conventional range query,we have to split it into for which the specified storage region Sj is queried is fa many sub-range queries for various zones,apparently it cannot Probo(Sj). be energy efficient due to the large number of routing cost for 168 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
we can also define a discrete form of pi(x) as follows, we equally divide the region [0, 1] into m sub-regions, and leverage the histogram method to calculate the probability pi,j ,(j = 1, ..., m) and we have: Xm j=1 pi,j = 1 (3) Hence consider an index node over the k-d tree with multidimensional range [(α1, β1),(α2, β2), · · · ,(αk, βk)], where αi and βi respectively denote the lower bound and upper bound of specified range of attribute Ai , we have 0 ≤ αi < βi ≤ 1, assume the above multi-dimensional range is associated with a two-dimensional spacial region Sj , therefore we can calculate the probability that a random event data falls into this region Sj as: P robE (Sj ) = Y k i=1 Z βi αi pi(x)dx (4) Assume the overall event frequency is fe, thus the frequency for which the specific storage region Sj is accessed is fe · P robE (Sj ). C. Modelling Range Query Distribution Unlike event distributions over attributes, which are mutually exclusive over each attribute Ai , the range query distributions over attributes are overlapped because each query is corresponding to a range over the attribute dimensions. Users can specify an k dimensional range query in the form like h[L1, U1], [L2, U2], ..., [Lk, Uk]i over k attributes, where 0 ≤ Li ≤ Ui ≤ 1,(1 ≤ i ≤ k), for partial range query which does not include attribute Ai , we can set Li = 0, Ui = 1. Thus we utilize another method to depict the distributions: Given a training set of queries Q, we have Q = {Q1, Q2, ..., QN }, where query Qj has a query range over each attribute Ai , which is (Li , Ui), hence we define an indicator variable Bi for a specified range (αi , βi) of attribute Ai as follows: Bi = 1 if (αi , βi) ∩ (Li , Ui) 6= φ 0 otherwise (5) And we further calculate the query frequency fi for the specified range (αi , βi) over attribute Ai : fi = X ∀Qj∈Q Bi (6) Thus the specified attribute range will be queried with probability pi = fi/N. Assume query distributions over various attributes are mutually independent, given a multi-dimensional data region which is associated with a two-dimensional spacial region Sj , we can finally get the probability for the specified storage region Sj to be queried as: P robQ(Sj ) = Y k i=1 pi (7) Assume the overall query frequency is fq, thus the frequency for which the specified storage region Sj is queried is fq · P robQ(Sj ). IV. DESIGN OF DECENTRALIZED STORAGE SCHEME FOR MULTI-DIMENSIONAL RANGE QUERIES Fig. 2. Decentralized Storage Scheme for Multi-Dimensional Range Queries We visualize the deployment field of a sensor network as a number of n × n equal sized grid cells. As DIM mainly applies for sparse sensor network where some zones may not have any sensor node, in this paper we consider the situation where the node density is high enough for each cell to have at least one storage node candidates, these cells are nominated as the storage units of the DCS scheme, at the center of each cell one exact sensor node is selected as the storage node. A. Indexing Storages over Cells We construct the index structure based on k-d tree over the cell-based storage deployment area, on one hand, each index node over the index structure represents a multi-dimensional range for the sensor data, on the other hand, it also associates with a specified spacial region over the two-dimensional deployment area. The definition of the index structure is consistent with DIM, except that we utilize the cell as the storage unit instead of sensor node, each index node corresponds to a spacial region which covers one or more cells. Fig.2 shows an example of our decentralized storage scheme to support multidimensional range queries. Fig.2(a) depicts the k-d tree based index structure and Fig.2(b) represents the corresponding zone code and boundaries. The black nodes denote the storage regions specified by the k-d tree and those cell(s) contained in the corresponding regions become(s) the replicate storage cells, which means each cell will store the same copy of the multi-dimensional ranged data specified by the index node over the k-d tree. B. Controlling Granularity for k-d Tree based Index Structure Now we investigate into the k-d tree based index structure, as already mentioned in Section I, we note that controlling the granularity of the index nodes is very crucial for energy efficiency of multi-dimensional event insertion and range query diffusion. If we split these sensor data into very refined zones according to multi-dimensional data attributes, thus to resolve a conventional range query, we have to split it into many sub-range queries for various zones, apparently it cannot be energy efficient due to the large number of routing cost for 168 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
query forwarding from query node to these various storage where Dauer(S)denotes the expected distance from an nodes.Based on the above observation,keeping these ranges arbitrary query node to the nearest storage node in the storage of data to be hosted in a relatively small number of zones region.And we can simply deduce that Dquery(Si)equals to may be more efficient,but obviously the extreme situation to Devent(Si). keep all ranges of data in one unique zone (with replicated storage scheme)is also not energy efficient,because although the query forwarding cost can be effectively reduced,large amount of energy will be drained for routing the same copy (X, (X,Y) of sensor data to all replicate storages.Based on the above analysis,we must find an optimized storage strategy to control ● the granularity for the index structure as a tradeoff between the two extremum situations to achieve best energy efficiency. Therefore to construct an optimized k-d tree based index Y (X..Y) structure over the sensor network,if we build the k-d tree ● from the root,then at each splitting step of the k-d tree,we have two optional strategies:one strategy is to conduct further splitting for the specified zones,another strategy is to stop Fig.3.Estimate the value for Dquery(Si)or Devent(i) further splitting and thus apply the replicate storage scheme to all cells covered by the specified zones.The most crucial We estimate the expectations of Dquery(Si)or Devent(Si) thing to construct the index structure is how to decide the as follows:for any event/query imposed inside the replicate storage region S;,as it only needs to find a nearest storage optimized strategy at each splitting step along the k-d tree,so as to achieve the best energy efficiency. node inside Si,the expected distance is less than half the cell size,which is d/2;for any event/query imposed outside To solve the above optimization problem,we start by the replicate storage region Si,as Fig.3 shows,we calculate first analyzing the energy cost for event insertion and query the distance according to its quadrant relative to Si,by processing over a specified storage region.In this paper we utilize the Euclidean Distance as the metric of energy summarizing the average distance from arbitrary points to consumption,which means the routing cost between two nodes the storage region,thus we can estimate the expectations of is represented by the Euclidean distance between them.For in- Dquery(Si)or Devent(Si). sertion of event,we leverage the double ruling scheme[9][10] As mentioned before,if we construct a k-d tree based index to distribute the event information to all relevant replicate structure by starting from the root,for each index node at the storage nodes in the storage region.For one specified storage splitting step,we have two optional strategies to choose:to region,the event node first routes event message to the nearest apply the replicate storage scheme over Si or conduct further storage node inside the region,and further forwards it both splitting?We know that the energy cost for replicate storage horizontally and vertically to all storage nodes covered by scheme over Si is: the storage region.Thus the expected energy cost for event Erep(Sa)=fe·ProbE(Si)·Eevent((Si) distribution over the specified storage region Si is: +fq·Probo(S)·Equery(S) (11) Eevent(Si)=Devent(Si)+d.(N(Si)-1) (8) The energy cost Erep(Si)includes the overall event inser- where Devent(Si)denotes the expected distance from an tion cost,as we know the event access frequency for Si arbitrary event node to the nearest storage node in storage is fe.Probe(Si)and the energy cost for each insertion is region Si,and d is the cell size,N(Si)denotes the number Eevent(Si),and it also includes the overall query diffusion of storage nodes inside Si,as each storage node inside Si cost with query access frequency faProb(S)and energy receives exactly one message except the first accessed storage cost for each query forwarding Eqer(S). node,thus the diffusion cost inside S;is d.(N(S;)-1). And consider the k-d tree based index structure,let us Assume the root of the k-d tree is at level 0,and the k-d tree denote Tichild(Si)as the zone associated with left child of indexes the cell unit at level h,(apparently h should be an Si,and Trchid(Si)as the zone associated with its right child, even number)),thus at level i(0≤i≤h),we have: and we utilize E(Si)to represent the optimized energy cost N(Si)=2h-i (9) for region Si,thus the energy cost for further splitting can be represented as: And for query diffusion,we only need to reach one of those specified replicate storage nodes,an intuitive strategy is for Esplit(Si)=E(Tichila(Si))+E(Trchita(Si))(12) the query node to find the nearest storage node in the storage region Si,later we will introduce an optimized routing scheme Therefore the optimized energy cost for region Si is of the for query diffusion,thus the expected energy cost for query following form: diffusion over the specified storage region Si is: E(Si)=min{Erep(Si),E(Tichila(Si))+E(Trchild(Si))} Equery(Si)=Dquery(Si) (10) (13) 169 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
query forwarding from query node to these various storage nodes. Based on the above observation, keeping these ranges of data to be hosted in a relatively small number of zones may be more efficient, but obviously the extreme situation to keep all ranges of data in one unique zone (with replicated storage scheme) is also not energy efficient, because although the query forwarding cost can be effectively reduced, large amount of energy will be drained for routing the same copy of sensor data to all replicate storages. Based on the above analysis, we must find an optimized storage strategy to control the granularity for the index structure as a tradeoff between the two extremum situations to achieve best energy efficiency. Therefore to construct an optimized k-d tree based index structure over the sensor network, if we build the k-d tree from the root, then at each splitting step of the k-d tree, we have two optional strategies: one strategy is to conduct further splitting for the specified zones, another strategy is to stop further splitting and thus apply the replicate storage scheme to all cells covered by the specified zones. The most crucial thing to construct the index structure is how to decide the optimized strategy at each splitting step along the k-d tree, so as to achieve the best energy efficiency. To solve the above optimization problem, we start by first analyzing the energy cost for event insertion and query processing over a specified storage region. In this paper we utilize the Euclidean Distance as the metric of energy consumption, which means the routing cost between two nodes is represented by the Euclidean distance between them. For insertion of event, we leverage the double ruling scheme[9][10] to distribute the event information to all relevant replicate storage nodes in the storage region. For one specified storage region, the event node first routes event message to the nearest storage node inside the region, and further forwards it both horizontally and vertically to all storage nodes covered by the storage region. Thus the expected energy cost for event distribution over the specified storage region Si is: Eevent(Si) = Devent(Si) + d · (N(Si) − 1) (8) where Devent(Si) denotes the expected distance from an arbitrary event node to the nearest storage node in storage region Si , and d is the cell size, N(Si) denotes the number of storage nodes inside Si , as each storage node inside Si receives exactly one message except the first accessed storage node, thus the diffusion cost inside Si is d · (N(Si) − 1). Assume the root of the k-d tree is at level 0, and the k-d tree indexes the cell unit at level h, (apparently h should be an even number), thus at level i (0 ≤ i ≤ h), we have: N(Si) = 2h−i (9) And for query diffusion, we only need to reach one of those specified replicate storage nodes, an intuitive strategy is for the query node to find the nearest storage node in the storage region Si , later we will introduce an optimized routing scheme for query diffusion, thus the expected energy cost for query diffusion over the specified storage region Si is: Equery(Si) = Dquery(Si) (10) where Dquery(Si) denotes the expected distance from an arbitrary query node to the nearest storage node in the storage region. And we can simply deduce that Dquery(Si) equals to Devent(Si). Fig. 3. Estimate the value for Dquery(Si) or Devent(Si) We estimate the expectations of Dquery(Si) or Devent(Si) as follows: for any event/query imposed inside the replicate storage region Si , as it only needs to find a nearest storage node inside Si , the expected distance is less than half the cell size, which is d/2; for any event/query imposed outside the replicate storage region Si , as Fig.3 shows, we calculate the distance according to its quadrant relative to Si , by summarizing the average distance from arbitrary points to the storage region, thus we can estimate the expectations of Dquery(Si) or Devent(Si). As mentioned before, if we construct a k-d tree based index structure by starting from the root, for each index node at the splitting step, we have two optional strategies to choose: to apply the replicate storage scheme over Si or conduct further splitting? We know that the energy cost for replicate storage scheme over Si is: Erep(Si) = fe · P robE (Si) · Eevent(Si) +fq · P robQ(Si) · Equery(Si) (11) The energy cost Erep(Si) includes the overall event insertion cost, as we know the event access frequency for Si is fe · P robE(Si) and the energy cost for each insertion is Eevent(Si), and it also includes the overall query diffusion cost with query access frequency fq · P robQ(Si) and energy cost for each query forwarding Equery(Si). And consider the k-d tree based index structure, let us denote Tlchild(Si) as the zone associated with left child of Si , and Trchild(Si) as the zone associated with its right child, and we utilize E(Si) to represent the optimized energy cost for region Si , thus the energy cost for further splitting can be represented as: Esplit(Si) = E(Tlchild(Si)) + E(Trchild(Si)) (12) Therefore the optimized energy cost for region Si is of the following form: E(Si) = min{Erep(Si), E(Tlchild(Si)) + E(Trchild(Si))} (13) 169 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply
Based on the above analysis we note that the optimized k-d Algorithm 1 Constructing k-d Tree based Index Structure with tree structure satisfies the following property:for an optimum Optimized Granularity k-d tree structure,its sub-tree structures are also optimum 1:Initially construct a k-d tree as a full binary tree with Therefore this optimization problem can be solved by dynamic maximum height of h,each leaf node denotes a unit cell. programming that works from the bottom to the top along the 2:for each leaf node associated with region Si do tree. 3: E(S)=fe·ProbE(S)Eevent(Si)+fg·Probo(Sa) C.Constructing Optimized Index Structure based on Dynamic Equery(Si) Programming 4:end for We construct the k-d tree from bottom to top,which is in 5:for all remaining nodes Sij with corresponding children post-order.Assume the maximum height of the k-d tree is nodes S;and Sj,in post-order do h,initially consider a full k-d tree with all leaf nodes at the 6. costl is associated with the replicate storage scheme hth level,each leaf node correspond to a unit cell,thus the for region Si.j} number of nodes over the k-d tree is n=2h+1-1.For each cost1=fe·ProbE(S,)·Eevent(Sii)+fg· two nodes i and j within one sub-tree,we denote Si and S;as ProbQ(S,)·Equery(S,i) cost2 is associated with the split storage scheme for their respective regions,and we denote Si.;as the super-region region Si.j} composed of the two regions,which is corresponding to their parent node.We will try to merge two split nodes along the cost2=E(S)+E(S)】 10: tree if the energy cost for replicate storage scheme is lower E(Si.j)=min(cost1,cost2) 11: if cost1<cost2 then than the sum of each split nodes. 12 Algorithm I depicts the pseudo code of this dynamic Apply the replicate storage scheme to region Sij, programming based algorithm.Assume that the n nodes in merge two split nodes Si and Sj into one node Si.j the k-d tree are labelled using the post-order.A table E[l...n] for the k-d tree 13: end if is used to hold the minimum energy cost of each node which 14:end for corresponds to a specific storage region Si,here for ease of presentation we denote it as E(Si).In the algorithm,line 3 computes the energy cost E(Si)for all leave nodes and lines 6-10 compute the minimum energy cost E(Si.j)for As described in Section IV.when the event message reaches the remaining internal nodes in post-order.As there're O(n) the zone,we utilize the Double Ruling scheme [9][10]to entries of energy cost to compute according to all the nodes distribute the event information to all relevant replicate storage over the initial full k-d tree,and at each step one exact entry is nodes. computed,the time complexity of Algorithm I is O(n),where B.Resolving and Routing Range Queries n is the number of nodes over the initial full k-d tree. For implementation of optimized index structure construc- As a range query corresponds to a region over the tion,first the sink calculates the event data distribution and multi-dimensional data space,thus a conventional range range query distribution according to a training set by lever- query may cover several zones over the k-d tree based aging the methodology proposed in Section III,and further index structure.For example,consider a range query calculates optimized parameters for the optimized k-d tree (0.3,0.7,[0.5,0.6,0.4,0.8,0.7,0.9j)over the example k- d tree depicted in Fig.2(a),the storage regions of zone code based on dynamic programming,then it broadcasts the index structure information to the sensor network,after receiving 110.111,010.0111 will be involved.Therefore how to route this information,those selected storage nodes of each cell these sub-range queries to those specified storage nodes in will confirm their storage roles which includes corresponding an energy efficient approach is the problem we are trying to storage schemes and the multi-dimensional ranges of data they address in this section.As the corresponding query replies can will store,and each sensor node will know where to store/find be sent back to query node along the query routing paths,thus here we only consider the query diffusion cost.for the overall the relevant ranges of data according to the locally stored index cost is actually twice the query diffusion cost. structure. Basically there are three strategies to route these sub- V.EVENT STORAGE AND QUERY PROCESSING queries.The first strategy is to route sub-queries separately MECHANISMS to corresponding zones by selecting nearest storage nodes for A.Inserting Events to Cells the query node.As Fig.4(a)shows,for each relevant storage region Si associated with sub-query gi,the query node selects As an event data is a point value over the multi-dimensional a nearest storage node from those replicate storages,and data space,it corresponds to a unique zone over the de- respectively sends a sub-query message to the corresponding ployment area.When a sensor node generates an event data it will check the locally stored index structure to find the storage node.In this scheme,no routing paths are shared. The second strategy leverages Double Ruling based routing corresponding zone,then it leverages the GPSR[1]routing approach.As Fig.4(b)shows,the query node first imposes scheme to forward the event message to that specified region. a routing path horizontally,and at specific positions of the 170 Authorized licensed use limited to:Nanjing University.Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore.Restrictions apply
Based on the above analysis we note that the optimized k-d tree structure satisfies the following property: for an optimum k-d tree structure, its sub-tree structures are also optimum. Therefore this optimization problem can be solved by dynamic programming that works from the bottom to the top along the tree. C. Constructing Optimized Index Structure based on Dynamic Programming We construct the k-d tree from bottom to top, which is in post-order. Assume the maximum height of the k-d tree is h, initially consider a full k-d tree with all leaf nodes at the hth level, each leaf node correspond to a unit cell, thus the number of nodes over the k-d tree is n = 2h+1 − 1. For each two nodes i and j within one sub-tree, we denote Si and Sj as their respective regions, and we denote Si,j as the super-region composed of the two regions, which is corresponding to their parent node. We will try to merge two split nodes along the tree if the energy cost for replicate storage scheme is lower than the sum of each split nodes. Algorithm 1 depicts the pseudo code of this dynamic programming based algorithm. Assume that the n nodes in the k-d tree are labelled using the post-order. A table E[1...n] is used to hold the minimum energy cost of each node which corresponds to a specific storage region Si , here for ease of presentation we denote it as E(Si). In the algorithm, line 3 computes the energy cost E(Si) for all leave nodes and lines 6-10 compute the minimum energy cost E(Si,j ) for the remaining internal nodes in post-order. As there’re O(n) entries of energy cost to compute according to all the nodes over the initial full k-d tree, and at each step one exact entry is computed, the time complexity of Algorithm 1 is O(n), where n is the number of nodes over the initial full k-d tree. For implementation of optimized index structure construction, first the sink calculates the event data distribution and range query distribution according to a training set by leveraging the methodology proposed in Section III, and further calculates optimized parameters for the optimized k-d tree based on dynamic programming, then it broadcasts the index structure information to the sensor network, after receiving this information, those selected storage nodes of each cell will confirm their storage roles which includes corresponding storage schemes and the multi-dimensional ranges of data they will store, and each sensor node will know where to store/find the relevant ranges of data according to the locally stored index structure. V. EVENT STORAGE AND QUERY PROCESSING MECHANISMS A. Inserting Events to Cells As an event data is a point value over the multi-dimensional data space, it corresponds to a unique zone over the deployment area. When a sensor node generates an event data, it will check the locally stored index structure to find the corresponding zone, then it leverages the GPSR[1] routing scheme to forward the event message to that specified region. Algorithm 1 Constructing k-d Tree based Index Structure with Optimized Granularity 1: Initially construct a k-d tree as a full binary tree with maximum height of h, each leaf node denotes a unit cell. 2: for each leaf node associated with region Si do 3: E(Si) = fe ·P robE(Si)·Eevent(Si)+fq ·P robQ(Si)· Equery(Si) 4: end for 5: for all remaining nodes Si,j with corresponding children nodes Si and Sj , in post-order do 6: {cost1 is associated with the replicate storage scheme for region Si,j} 7: cost1 = fe · P robE(Si,j ) · Eevent(Si,j ) + fq · P robQ(Si,j ) · Equery(Si,j ) 8: {cost2 is associated with the split storage scheme for region Si,j} 9: cost2 = E(Si) + E(Sj ) 10: E(Si,j ) = min(cost1, cost2) 11: if cost1 < cost2 then 12: Apply the replicate storage scheme to region Si,j , merge two split nodes Si and Sj into one node Si,j for the k-d tree. 13: end if 14: end for As described in Section IV, when the event message reaches the zone, we utilize the Double Ruling scheme [9][10] to distribute the event information to all relevant replicate storage nodes. B. Resolving and Routing Range Queries As a range query corresponds to a region over the multi-dimensional data space, thus a conventional range query may cover several zones over the k-d tree based index structure. For example, consider a range query h[0.3, 0.7], [0.5, 0.6], [0.4, 0.8], [0.7, 0.9]i over the example kd tree depicted in Fig.2(a), the storage regions of zone code 110,111, 010, 0111 will be involved. Therefore how to route these sub-range queries to those specified storage nodes in an energy efficient approach is the problem we are trying to address in this section. As the corresponding query replies can be sent back to query node along the query routing paths, thus here we only consider the query diffusion cost, for the overall cost is actually twice the query diffusion cost. Basically there are three strategies to route these subqueries. The first strategy is to route sub-queries separately to corresponding zones by selecting nearest storage nodes for the query node. As Fig.4(a) shows, for each relevant storage region Si associated with sub-query qi , the query node selects a nearest storage node from those replicate storages, and respectively sends a sub-query message to the corresponding storage node. In this scheme, no routing paths are shared. The second strategy leverages Double Ruling based routing approach. As Fig.4(b) shows, the query node first imposes a routing path horizontally, and at specific positions of the 170 Authorized licensed use limited to: Nanjing University. Downloaded on July 11,2010 at 07:42:17 UTC from IEEE Xplore. Restrictions apply