IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL 26,NO.9,SEPTEMBER 2015 2421 Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie,Member,IEEE,Hao Han,Member,IEEE,Qun Li,Member,IEEE, Jie Wu,Fellow,IEEE,and Sanglu Lu,Member,IEEE Abstract-Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications.In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data.In order to achieve time efficiency,we propose a novel,ensemble sampling-based method to simultaneously estimate the tag size for a number of categories.We first consider the problem of basic histogram collection,and propose an efficient algorithm based on the idea of ensemble sampling.We further consider the problems of advanced histogram collection,respectively,with an iceberg query and a top-k:query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified.This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators,which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms-Algorithms,RFID,time efficiency,histogram INTRODUCTION Win have lep me prv in a book store,etc.Collecting histogram can be used to illustrate the tag population belonging to each category,and spaces in increasingly large numbers.In applications like determine whether the number of tags in a category is warehouse monitoring,the items are attached with RFID above or below any desired threshold.By setting this tags,and are densely packed into boxes.As the maximum threshold,it is easy to find popular merchandise and control scanning range of a UHF RFID reader is usually 6-10 m,the stock,e.g.,automatically signaling when more products overall number of tags within this three-dimensional space need to be put on the shelf.Furthermore,the histogram can can be up to tens of thousands in a dense deployment sce- be used for approximate answering of aggregate queries nario,as envisioned in [1],[2],[3].Many tag identification [121,[131,as well as preprocessing and mining association protocols [41,[5],[61,[7],[8]are proposed to uniquely iden- rules in data mining [14].Therefore,collecting histograms tify the tags one by one through anti-collision schemes. over RFID tags is an essential premise for effective queries However,in a number of applications,only some useful sta-and analysis in conventional RFID-based applications. tistical information is essential to be collected,such as the Fig.1 shows an example for collecting histogram over the overall tag size [2],[91,[10],popular categories [11]and the RFID tags deployed in the application scenarios. histogram.In particular,histograms capture distribution While dealing with a large scale deployment with thou- statistics in a space-efficient fashion.In some applications, sands of tags,the traditional tag identification scheme is not such as a grocery store or a shipping portal,items are cate- suitable for histogram collection,since the scanning time is gorized according to some specified metrics,such as types proportional to the number of tags,which can be in the of merchandize,manufacturers,etc.A histogram is used to order of several minutes.As the overall tag size grows, illustrate the number of items in each category. reading each tag one by one can be rather time-consuming, In practice,tags are typically attached to objects belong- which is not scalable at all.As in most applications,the tags ing to different categories,e.g.,different brands and models are frequently moving into and out of the effective scanning of clothes in a large clothing store,different titles of books area.In order to capture the distribution statistics in time,it is essential to sacrifice some accuracy so that the main distri- L.Xie and S.Lu are with the State Key Laboratory for Novel Software bution can be obtained within a short time window-in the Technology,Nanjing University,China. order of several seconds.Therefore,we seek to propose an E-mail:(Ixie,sangluj@nju.edu.cn. estimation scheme to quickly count the tag sizes of each cat- H.Han and Q.Li are with the Department of Computer Science,College of William and Mary,Williamsburg,VA.E-mail:(hhan,liqun@cs.wm.edu. egory while achieving the accuracy requirement. ● I.Wu is with the Department of Computer Information and Sciences, In most cases,the tag sizes of various categories are sub- Temple University.E-mail:jiewu@temple.edu. ject to some skewed distribution with a "long tail",such as Manuscript received 10 Mar.2014;revised 12 Aug.2014;accepted 4 Sept. the Gaussian distribution.The long tail represents a large 2014.Date of publication 10 Sept.2014;date of current version 7 Aug.2015. number of categories,each of which occupies a rather small Recommended for acceptance by S.Guo. For information on obtaining reprints of this article,please send e-mail to: percentage among the total categories.While handling the reprints@ieee.org,and reference the Digital Object Identifier below. massive number of tags,in the order of several thousands, Digital Object Identifier no.10.1109/TPDS.2014.2357021 the overall number of categories in the long tail could be in 1045-2192014EEE Personal mission See http://www.ieee.org/publi
Efficient Protocols for Collecting Histograms in Large-Scale RFID Systems Lei Xie, Member, IEEE, Hao Han, Member, IEEE, Qun Li, Member, IEEE, Jie Wu, Fellow, IEEE, and Sanglu Lu, Member, IEEE Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in large-scale RFID-based applications. In this paper we consider an efficient collection of histograms from the massive number of RFID tags, without the need to read all tag data. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. We first consider the problem of basic histogram collection, and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-k query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. This ensemble sampling-based framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms, RFID, time efficiency, histogram Ç 1 INTRODUCTION WITH the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags, and are densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10 m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1], [2], [3]. Many tag identification protocols [4], [5], [6], [7], [8] are proposed to uniquely identify the tags one by one through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2], [9], [10], popular categories [11] and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In some applications, such as a grocery store or a shipping portal, items are categorized according to some specified metrics, such as types of merchandize, manufacturers, etc. A histogram is used to illustrate the number of items in each category. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histogram can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries [12], [13], as well as preprocessing and mining association rules in data mining [14]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. Fig. 1 shows an example for collecting histogram over the RFID tags deployed in the application scenarios. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As the overall tag size grows, reading each tag one by one can be rather time-consuming, which is not scalable at all. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window–in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skewed distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive number of tags, in the order of several thousands, the overall number of categories in the long tail could be in L. Xie and S. Lu are with the State Key Laboratory for Novel Software Technology, Nanjing University, China. E-mail: {lxie, sanglu}@nju.edu.cn. H. Han and Q. Li are with the Department of Computer Science, College of William and Mary, Williamsburg, VA. E-mail: {hhan, liqun}@cs.wm.edu. J. Wu is with the Department of Computer Information and Sciences, Temple University. E-mail: jiewu@temple.edu. Manuscript received 10 Mar. 2014; revised 12 Aug. 2014; accepted 4 Sept. 2014. Date of publication 10 Sept. 2014; date of current version 7 Aug. 2015. Recommended for acceptance by S. Guo. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2014.2357021 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015 2421 1045-9219 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information
2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 Histogram RFID-based Application Scenario 2 RELATED WORK 10 In RFID systems,a reader needs to receive data from multi- ple tags,while the tags are unable to self-regulate their radio 10 threshold transmissions to avoid collisions;then,a series of slotted =50 ALOHA-based anti-collision protocols [1],[4],[5],[6],[7], [8],[16],[17]are designed to efficiently identify tags in RFID systems.In order to deal with the collision problems in CI C2 C3 C4 C5 C6 C7 C8 C9 C10 multi-reader RFID systems,scheduling protocols for reader Category ID activation are explored in the literature [18],[19].Recently, Fig.1.An example of collecting histogram over RFID tags a number of polling protocols [201,[211,[22]are proposed, aiming to collect information from battery-powered active hundreds.Therefore,by separately estimating the tag sizes tags in an energy efficient approach. Recent research is focused on the collection of statistical over each category,a large number of query cycles and slots information over the RFID tags [2],[9],[10],[11],[23],[24], are required.Besides,in applications like the iceberg query [25],[26],[27].The authors mainly consider the problem of and the top-k query,only those major categories are essen- tial to be addressed.In this situation,the separate estimate estimating the number of tags without collecting the tag IDs. Murali et al.provide very fast and reliable estimation mech- approach will waste a lot of scanning time over those minor categories in the long tail.Therefore,a novel scheme is anisms for tag quantity in a more practical approach [9].Li essential to quickly collect the histograms over the massive et al.study the RFID estimation problem from the energy angle [23].Their goal is to reduce the amount of energy that RFID tags.In this paper,we propose a series of protocols to tackle the problem of efficient histogram collection.The is consumed by the tags during the estimation procedure. main contributions of this paper are listed as follows(a pre- Shahzad et al.propose a new scheme for estimating tag pop- liminary version of this work appeared in [151): ulation size called average run based tag estimation(ART) [2].Chen et al.aim to gain deeper and fundamental insights 1) To the best of our knowledge,we are the first to con- in RFID counting protocols [271,they manage to design sider the problem of collecting histograms and its near-optimal protocols that are more efficient than existing applications (i.e.,iceberg query and top-k query) ones and simultaneously simpler than most of them.Liu over RFID tags,which is a fundamental premise for et al.investigate efficient distributed query processing in answering aggregate queries and data mining over large RFID-enabled supply chains [28].Liu et al.propose a RFID-based applications. novel solution to fast count the key tags in anonymous RFID 2) In order to achieve time efficiency,we propose a systems [29].Luo et al.tackle an interesting problem,called novel,ensemble sampling (ES)-based method to multigroup threshold based classification [25],which is to simultaneously estimate the tag size for a number of determine whether the number of objects in each group is categories.This framework is very flexible and com- above or below a prescribed threshold value.Sheng et al. patible to current tag-counting estimators,which can consider the problem of identifying popular categories of be efficiently leveraged to estimate the tag size for RFID tags out of a large collection of tags [11],while the set each category.While achieving time-efficiency,our of category IDs are supposed to be known.Different from solutions are completely compatible with current the previous work,in this paper,our goal is to collect the his- industry standards,i.e.,the EPCglobal C1G2 stand- tograms for all categories over RFID tags in a time-efficient ards,and do not require any tag modifications. approach,without any priori knowledge of the categories. 3) In order to tackle the histogram collection with a Specifically,we respectively consider the basic histogram filter condition,we propose an effective solution collection problem,the iceberg query problem,and the top-k for the iceberg query problem.By considering the query problem in regard to collecting histograms in large- population and accuracy constraint,we propose an scale RFID systems.We aim to propose a flexible and com- efficient algorithm to wipe out the unqualified cat- patible framework for current tag-counting estimators based egories in time,especially those categories in the on slotted ALOHA protocol,which can be efficiently lever- long tail.We further present an effective solution aged to estimate the tag size for each category. to tackle the top-k query problem.We use ensemble sampling to quickly estimate the threshold corre- 3 PRELIMINARY sponding to the kth largest category,and reduce it 3.1 The Framed Slotted ALOHA Protocol to the iceberg query problem. In the Class 1 Gen 2 standard,the RFID system leverages the The remainder of the paper is as follows.Sections 2 framed slotted ALOHA protocol to resolve the collisions for tag and 3 present the related work and RFID preliminary, identification.When a reader wishes to read a set of tags,it respectively.We formulate our problem in Section 4,and first powers up and transmits a continuous wave to ener- present our ensemble sampling-based method for the gize the tags.It then initiates a series of frames,varying the basic histogram collection in Section 5.We further present number of slots in each frame to best accommodate the our solutions for the iceberg query and the top-k query,number of tags.Each frame has a number of slots and each respectively,in Sections 6 and 7.In Section 8,we provide active tag will reply in a randomly selected slot per frame. performance analysis in time-efficiency.The performance After all tags are read,the reader powers down.We refer to evaluation is in Section 9,and we conclude in Section 10. the series of frames between power down periods as a
hundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-k query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. The main contributions of this paper are listed as follows (a preliminary version of this work appeared in [15]): 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms and its applications (i.e., iceberg query and top-k query) over RFID tags, which is a fundamental premise for answering aggregate queries and data mining over RFID-based applications. 2) In order to achieve time efficiency, we propose a novel, ensemble sampling (ES)-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. While achieving time-efficiency, our solutions are completely compatible with current industry standards, i.e., the EPCglobal C1G2 standards, and do not require any tag modifications. 3) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution to tackle the top-k query problem. We use ensemble sampling to quickly estimate the threshold corresponding to the kth largest category, and reduce it to the iceberg query problem. The remainder of the paper is as follows. Sections 2 and 3 present the related work and RFID preliminary, respectively. We formulate our problem in Section 4, and present our ensemble sampling-based method for the basic histogram collection in Section 5. We further present our solutions for the iceberg query and the top-k query, respectively, in Sections 6 and 7. In Section 8, we provide performance analysis in time-efficiency. The performance evaluation is in Section 9, and we conclude in Section 10. 2 RELATED WORK In RFID systems, a reader needs to receive data from multiple tags, while the tags are unable to self-regulate their radio transmissions to avoid collisions; then, a series of slotted ALOHA-based anti-collision protocols [1], [4], [5], [6], [7], [8], [16], [17] are designed to efficiently identify tags in RFID systems. In order to deal with the collision problems in multi-reader RFID systems, scheduling protocols for reader activation are explored in the literature [18], [19]. Recently, a number of polling protocols [20], [21], [22] are proposed, aiming to collect information from battery-powered active tags in an energy efficient approach. Recent research is focused on the collection of statistical information over the RFID tags [2], [9], [10], [11], [23], [24], [25], [26], [27]. The authors mainly consider the problem of estimating the number of tags without collecting the tag IDs. Murali et al. provide very fast and reliable estimation mechanisms for tag quantity in a more practical approach [9]. Li et al. study the RFID estimation problem from the energy angle [23]. Their goal is to reduce the amount of energy that is consumed by the tags during the estimation procedure. Shahzad et al. propose a new scheme for estimating tag population size called average run based tag estimation (ART) [2]. Chen et al. aim to gain deeper and fundamental insights in RFID counting protocols [27], they manage to design near-optimal protocols that are more efficient than existing ones and simultaneously simpler than most of them. Liu et al. investigate efficient distributed query processing in large RFID-enabled supply chains [28]. Liu et al. propose a novel solution to fast count the key tags in anonymous RFID systems [29]. Luo et al. tackle an interesting problem, called multigroup threshold based classification [25], which is to determine whether the number of objects in each group is above or below a prescribed threshold value. Sheng et al. consider the problem of identifying popular categories of RFID tags out of a large collection of tags [11], while the set of category IDs are supposed to be known. Different from the previous work, in this paper, our goal is to collect the histograms for all categories over RFID tags in a time-efficient approach, without any priori knowledge of the categories. Specifically, we respectively consider the basic histogram collection problem, the iceberg query problem, and the top-k query problem in regard to collecting histograms in largescale RFID systems. We aim to propose a flexible and compatible framework for current tag-counting estimators based on slotted ALOHA protocol, which can be efficiently leveraged to estimate the tag size for each category. 3 PRELIMINARY 3.1 The Framed Slotted ALOHA Protocol In the Class 1 Gen 2 standard, the RFID system leverages the framed slotted ALOHA protocol to resolve the collisions for tag identification. When a reader wishes to read a set of tags, it first powers up and transmits a continuous wave to energize the tags. It then initiates a series of frames, varying the number of slots in each frame to best accommodate the number of tags. Each frame has a number of slots and each active tag will reply in a randomly selected slot per frame. After all tags are read, the reader powers down. We refer to the series of frames between power down periods as a Fig. 1. An example of collecting histogram over RFID tags 2422 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2423 TABLE 1 inter-cyde duration The Average Time Interval for Various Slots after QueryRep command after Query command empty slot 0.9ms 1.7ms singleton slot 4.1ms 5.1ms collision slot 1.3ms 2.2ms specified set of tags,conventionally,the reader should issue 100 200 60 multiple query cycles over the same set of tags and take the (a)The raw signal data for 5 continuous cycles average of the estimates.The inter-cycle overhead consists of the time between cycles when the reader is powered down,and the carrier time used to power the tags before beginning communication.According to the experiment results in [30],which are conducted in realistic settings, these times are 40 ms and 3 ms respectively,while the aver- age time interval per slot is 1~2 ms. We have further measured the time interval for various slots and the inter-cycle duration with the USRP N210 plat- form.In our experiments,we use the Alien-9900 reader and Alien-9611 linear antenna with a directional gain of 6 dB. The The RFID tags used are Alien 9640 general-purpose tags (b)The raw signal data for one frame which support the EPC C1G2 standards.We use Alien Fig.2.The captured raw signal data of the interrogation between the reader to continuously read 13 tags for 100 query cycles.We reader and the tag use USRP N210 as a sniffer to capture the physical signals. Fig.2 shows an example of the captured raw signal data of Query Cycle.Note that,within each frame,tags may choose the interrogation between the reader and the tag.According the same slot,which causes multiple tags to reply during a to the realistic experiment results in this setting,the average slot.Therefore,within each frame there exist three kinds of intervals for various slots are summarized in Table 1.It is slots:(1)the empty slot where no tag replies;(2)the single found that,in most cases,the slot is started with a QueryRep slot where only one tag replies;and (3)the collision slot command,then the average interval for empty slots is where multiple tags reply. 0.9 ms per slot,the average interval for singleton slots is 4.1 In regard to the tag ID,each tag has a unique 96-bit ID in ms per slot,and the average interval for collision slots is 1.3 its EPC memory,where the first s binary bits can be ms per slot;when a slot happens to be the first slot of a regarded as the category ID(1<s<96).According to the frame,the slot is started with a Query command,then the C1G2 standard,for each Query Cycle,the reader is able to average interval for empty slots is 1.7 ms per slot,the aver- select the tags in a specified category by sending a Select age interval for singleton is 5.1 ms per slot,and the average command with an s-bit mask in the category ID field.If mul- interval for collision slots is 2.2 ms per slot.By measuring tiple categories need to be selected,the reader can provide the time intervals between two adjacent query cycles,it is multiple bit masks in the Select command. found that the average interval for inter-cycle duration is 28.3 ms.Note that if the powered-down interval is not long 3.2 Basic Tag Identification versus enough,it is possible that some surrounding tags will main- the Estimation Scheme tain the former state for the inventoried flag with their local Assume that there are n tags in total,and that it takes s;slots residual power,which causes them to keep silent in the to uniquely identify n tags.It is known that for each query upcoming query cycle. round,when the frame size f is equal to the remaining Therefore,since the average inter-cycle duration(28.3 ms) number of tags,the proportion of singleton slots inside the is much larger than the average time interval of conventional frame is maximized;then,the efficiency is"=.Hence,the slots (empty slot:0.9 ms,singleton slot:4.1 ms,collision slot: essential number of slots is s(1).n=n.e. 1.3 ms),the inter-cycle duration must be taken into account when considering overall reading performance.It is obvious Therefore,assume that it takes se slots to estimate the tag that reading a large number of tags per cycle amortizes the size for each category with a certain accuracy.If we want cost of inter-cycle overhead,resulting in lower per tag read- the estimation scheme to achieve a better reading perfor- ing time,while for small tag sets the inter-cycle overhead is mance than the basic tag identification method,then we significant.It is essential to sufficiently reduce the inter-cycle need se.lesili,where le and li are the sizes of the bit overhead when we design a solution and set the correspond- strings transmitted during the estimation and identification ing parameters for RFID systems. phases,respectively. 3.3 The Impact of the Inter-Cycle Overhead 4 PROBLEM FORMULATION The MAC protocol for the C1G2 system is based on slotted Suppose there are a large number of tags in the effective ALOHA.In order to accurately estimate the size of a scanning area of the RFID reader,the RFID system conforms
Query Cycle. Note that, within each frame, tags may choose the same slot, which causes multiple tags to reply during a slot. Therefore, within each frame there exist three kinds of slots: (1) the empty slot where no tag replies; (2) the single slot where only one tag replies; and (3) the collision slot where multiple tags reply. In regard to the tag ID, each tag has a unique 96-bit ID in its EPC memory, where the first s binary bits can be regarded as the category ID (1 <s< 96). According to the C1G2 standard, for each Query Cycle, the reader is able to select the tags in a specified category by sending a Select command with an s-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. 3.2 Basic Tag Identification versus the Estimation Scheme Assume that there are n tags in total, and that it takes si slots to uniquely identify n tags. It is known that for each query round, when the frame size f is equal to the remaining number of tags, the proportion of singleton slots inside the frame is maximized; then, the efficiency is ns f ¼ 1 e . Hence, the essential number of slots is si ¼ Pþ1 i¼0 ð1 1 e Þ i n ¼ n e. Therefore, assume that it takes se slots to estimate the tag size for each category with a certain accuracy. If we want the estimation scheme to achieve a better reading performance than the basic tag identification method, then we need se le si li, where le and li are the sizes of the bit strings transmitted during the estimation and identification phases, respectively. 3.3 The Impact of the Inter-Cycle Overhead The MAC protocol for the C1G2 system is based on slotted ALOHA. In order to accurately estimate the size of a specified set of tags, conventionally, the reader should issue multiple query cycles over the same set of tags and take the average of the estimates. The inter-cycle overhead consists of the time between cycles when the reader is powered down, and the carrier time used to power the tags before beginning communication. According to the experiment results in [30], which are conducted in realistic settings, these times are 40 ms and 3 ms respectively, while the average time interval per slot is 1 2 ms. We have further measured the time interval for various slots and the inter-cycle duration with the USRP N210 platform. In our experiments, we use the Alien-9900 reader and Alien-9611 linear antenna with a directional gain of 6 dB. The RFID tags used are Alien 9640 general-purpose tags which support the EPC C1G2 standards. We use Alien reader to continuously read 13 tags for 100 query cycles. We use USRP N210 as a sniffer to capture the physical signals. Fig. 2 shows an example of the captured raw signal data of the interrogation between the reader and the tag. According to the realistic experiment results in this setting, the average intervals for various slots are summarized in Table 1. It is found that, in most cases, the slot is started with a QueryRep command, then the average interval for empty slots is 0.9 ms per slot, the average interval for singleton slots is 4.1 ms per slot, and the average interval for collision slots is 1.3 ms per slot; when a slot happens to be the first slot of a frame, the slot is started with a Query command, then the average interval for empty slots is 1.7 ms per slot, the average interval for singleton is 5.1 ms per slot, and the average interval for collision slots is 2.2 ms per slot. By measuring the time intervals between two adjacent query cycles, it is found that the average interval for inter-cycle duration is 28.3 ms. Note that if the powered-down interval is not long enough, it is possible that some surrounding tags will maintain the former state for the inventoried flag with their local residual power, which causes them to keep silent in the upcoming query cycle. Therefore, since the average inter-cycle duration (28.3 ms) is much larger than the average time interval of conventional slots (empty slot: 0.9 ms, singleton slot: 4.1 ms, collision slot: 1.3 ms), the inter-cycle duration must be taken into account when considering overall reading performance. It is obvious that reading a large number of tags per cycle amortizes the cost of inter-cycle overhead, resulting in lower per tag reading time, while for small tag sets the inter-cycle overhead is significant. It is essential to sufficiently reduce the inter-cycle overhead when we design a solution and set the corresponding parameters for RFID systems. 4 PROBLEM FORMULATION Suppose there are a large number of tags in the effective scanning area of the RFID reader, the RFID system conforms Fig. 2. The captured raw signal data of the interrogation between the reader and the tag TABLE 1 The Average Time Interval for Various Slots after QueryRep command after Query command empty slot 0.9 ms 1.7 ms singleton slot 4.1 ms 5.1 ms collision slot 1.3 ms 2.2 ms XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2423
2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL.26.NO.9.SEPTEMBER 2015 to EPCglobal C1G2 standards,i.e.,the slotted ALOHA- algorithm to work,we only require the tags to comply based anti-collision scheme [4],[6]is used in the system with the current C1G2 standards:each tag has a unique model.The objective is to collect the histogram over RFID 96-bit ID in its EPC memory,where the first s binary bits tags according to some categorized metric,e.g,the type of are regarded as the category ID(1<s 96).According to merchandise,while the present set of category IDs cannot the C1G2 standard,the reader is able to select the tags in be predicted in advance.As we aim at a dynamic environ-a specified category by sending a Select command with an ment where the tags may frequently enter and leave the s-bit mask in the category ID field.If multiple categories scanning area,a time-efficient strategy must be proposed.need to be selected,the reader can provide multiple bit Therefore,the specified accuracy can be relaxed in order to masks in the Select command. quickly collect the histogram.Assume that the overall tag size is n,there exist m categories C=[C1,C2....,Cm},and 5 USE ENSEMBLE SAMPLING TO the actual tag size for each category is n,n2,...,nm. COLLECT HISTOGRAMS In the Basic Histogram Collection,the RFID system needs to collect the histogram for all categories.Due to the inherent When collecting the histograms over a large number of cate- inaccurate property for RFID systems,users can specify the gories,the objective is to minimize the overall scanning accuracy requirement for the histogram collection.Suppose time while the corresponding accuracy/population con- the estimated tag size for category Ci(1<i<m)is ni,then straints are satisfied.Two straightforward solutions are the following accuracy constraint should be satisfied: summarized as follows:(1)Basic Tag Identification:The histo- gram is collected by uniquely identifying each tag from the rm-nl≤e·n≥1-B accuracy constraint. (1) massive tag set and putting it into the corresponding cate- gories,thus the accuracy is 100 percent,and (2)Separate The accuracy constraint illustrates that,given the exact tag Counting:As the category IDs cannot be predicted in size ni for a specified category,the estimated tag sizen advance,the tree traversal method [31]is used to obtain the should be in an confidence interval of width 2e.ni,i.e., with probability greater than 1-B.For category IDs.Then,the reader sends a Select command to the tags,and it activates the tags in the specified category by example,if e=0.1,B=0.05,then in regard to a category providing a bit mask over the category ID in the command. with tag size ni=100,the estimated tag size ni should be According to the replies from the specified tags,the estima- within the range [90,110]with probability greater than tors such as [9],[24],[32]can be used to estimate the tag size 95 percent. for each category.As the rough tag size for each category In the Iceberg Query Problem,only those categories with a cannot be predicted in advance,a fixed initial frame size is tag size over a specified threshold t are essential to be illus- used for each category. trated in the histogram,while the accuracy requirement is Both the above two solutions are not time-efficient.In satisfied.As the exact tag size n;for category Ci is unknown, regard to the basic tag identification,uniquely identifying then,given the estimated value of tag size m,it is possible to each tag in the massive set is not scalable,for as the tag size have false negative error and false positive error in verifying grows into a huge number,the scanning time can be an the population constraint.Therefore,it is essential to guar- unacceptable value.In regard to the separated counting,the antee that the false negative/positive rate is below B,that is: reader needs to scan each category with at least one query cycle,even if the category is a minor category,which is not Pr[i<tn:≥t<B, (2) necessarily addressed in the iceberg query and the top-k query.As the number of categories m can be fairly large, Pr≥tn贴<<B. (3) e.g.,in the order of hundreds,the Select command and the fixed initial frame size for each category,as well as the In the Top-k Query Problem,we use the definition of the inter-cycle overhead among a large number of query cycles, probabilistic threshold top-k query (PT-Topk query),i.e.,in make the overall scanning time rather large. regard to the tag size,only the set of categories where each Therefore,we consider an ensemble sampling-based esti- takes a probability of at least 1-B to be in the top-k list are mation scheme as follows:select a certain number of catego- illustrated in the histogram,while the accuracy requirement ries and issue a query cycle,obtain the empty/singleton/ is satisfied.Much like the iceberg query problem,as the collision slots,and then estimate the tag size for each of the exact tag size ni for category Ci is unknown,then,given the categories according to the sampling in the singleton slots. estimated value of tag size ni,it is possible to have false neg- In this way,the ensemble sampling is more preferred than ative error and false positive error in verifying the popula- the separate counting in terms of reading performance. tion constraint,the following constraint must be satisfied: Since more tags are involved in one query cycle,more slots Pr[C;is regarded out of top-k list C;top-k list]<B,(4)amortize the cost of inter-cycle overhead,the Select com- mand,as well as the fixed initial frame size.Thus,the over- PrC,is regarded in top-k listC top-k list]<B.(5)all scanning time can be greatly reduced. In this paper,we aim to propose a series of novel solu-5.1 The Estimator ES tions to tackle the above problems while satisfying the In the slotted ALOHA-based protocol,besides the empty slots following properties:(1)Time-efficient.(2)Simple for the and the collision slots,the singleton slots can be obtained.In tag side in the protocol.(3)Complies with the EPCglobal the ensemble sampling-based estimation,according to the C1G2 standards.Therefore,in order for the proposed observed statistics of the empty/singleton/collision slots,we
to EPCglobal C1G2 standards, i.e., the slotted ALOHAbased anti-collision scheme [4], [6] is used in the system model. The objective is to collect the histogram over RFID tags according to some categorized metric, e.g, the type of merchandise, while the present set of category IDs cannot be predicted in advance. As we aim at a dynamic environment where the tags may frequently enter and leave the scanning area, a time-efficient strategy must be proposed. Therefore, the specified accuracy can be relaxed in order to quickly collect the histogram. Assume that the overall tag size is n, there exist m categories C ¼ fC1; C2; ... ; Cmg, and the actual tag size for each category is n1; n2; ... ; nm. In the Basic Histogram Collection, the RFID system needs to collect the histogram for all categories. Due to the inherent inaccurate property for RFID systems, users can specify the accuracy requirement for the histogram collection. Suppose the estimated tag size for category Cið1 i mÞ is nbi, then the following accuracy constraint should be satisfied: Pr½jnbi nij ni 1 b accuracy constraint: (1) The accuracy constraint illustrates that, given the exact tag size ni for a specified category, the estimated tag size nbi should be in an confidence interval of width 2 ni, i.e., nbi ni 2 ½1 ; 1 þ with probability greater than 1 b. For example, if ¼ 0:1; b ¼ 0:05, then in regard to a category with tag size ni ¼ 100, the estimated tag size nbi should be within the range ½90;110 with probability greater than 95 percent. In the Iceberg Query Problem, only those categories with a tag size over a specified threshold t are essential to be illustrated in the histogram, while the accuracy requirement is satisfied. As the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint. Therefore, it is essential to guarantee that the false negative/positive rate is below b, that is: Pr½nbi < tjni t < b; (2) Pr½nbi tjni < t < b: (3) In the Top-k Query Problem, we use the definition of the probabilistic threshold top-k query (PT-Topk query), i.e., in regard to the tag size, only the set of categories where each takes a probability of at least 1 b to be in the top-k list are illustrated in the histogram, while the accuracy requirement is satisfied. Much like the iceberg query problem, as the exact tag size ni for category Ci is unknown, then, given the estimated value of tag size nbi, it is possible to have false negative error and false positive error in verifying the population constraint, the following constraint must be satisfied: Pr½Ci is regarded out of top-k listj Ci 2 top-k list < b; (4) Pr½Ci is regarded in top-k listj Ci 2= top-k list < b: (5) In this paper, we aim to propose a series of novel solutions to tackle the above problems while satisfying the following properties: (1) Time-efficient. (2) Simple for the tag side in the protocol. (3) Complies with the EPCglobal C1G2 standards. Therefore, in order for the proposed algorithm to work, we only require the tags to comply with the current C1G2 standards: each tag has a unique 96-bit ID in its EPC memory, where the first s binary bits are regarded as the category ID (1 <s< 96). According to the C1G2 standard, the reader is able to select the tags in a specified category by sending a Select command with an s-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. 5 USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS When collecting the histograms over a large number of categories, the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints are satisfied. Two straightforward solutions are summarized as follows: (1) Basic Tag Identification: The histogram is collected by uniquely identifying each tag from the massive tag set and putting it into the corresponding categories, thus the accuracy is 100 percent, and (2) Separate Counting: As the category IDs cannot be predicted in advance, the tree traversal method [31] is used to obtain the category IDs. Then, the reader sends a Select command to the tags, and it activates the tags in the specified category by providing a bit mask over the category ID in the command. According to the replies from the specified tags, the estimators such as [9], [24], [32] can be used to estimate the tag size for each category. As the rough tag size for each category cannot be predicted in advance, a fixed initial frame size is used for each category. Both the above two solutions are not time-efficient. In regard to the basic tag identification, uniquely identifying each tag in the massive set is not scalable, for as the tag size grows into a huge number, the scanning time can be an unacceptable value. In regard to the separated counting, the reader needs to scan each category with at least one query cycle, even if the category is a minor category, which is not necessarily addressed in the iceberg query and the top-k query. As the number of categories m can be fairly large, e.g., in the order of hundreds, the Select command and the fixed initial frame size for each category, as well as the inter-cycle overhead among a large number of query cycles, make the overall scanning time rather large. Therefore, we consider an ensemble sampling-based estimation scheme as follows: select a certain number of categories and issue a query cycle, obtain the empty/singleton/ collision slots, and then estimate the tag size for each of the categories according to the sampling in the singleton slots. In this way, the ensemble sampling is more preferred than the separate counting in terms of reading performance. Since more tags are involved in one query cycle, more slots amortize the cost of inter-cycle overhead, the Select command, as well as the fixed initial frame size. Thus, the overall scanning time can be greatly reduced. 5.1 The Estimator ES In the slotted ALOHA-based protocol, besides the empty slots and the collision slots, the singleton slots can be obtained. In the ensemble sampling-based estimation, according to the observed statistics of the empty/singleton/collision slots, we 2424 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 26, NO. 9, SEPTEMBER 2015
XIE ET AL.:EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2425 can use estimators in [9],[24],[32]etc.to estimate the overall Proof.See Appendix B,available in the online supple- tag size.Then,according to the response in each singleton slot, mental material. the category ID is obtained from the first s bits in the tag ID. Based on the sampling from the singleton slots,the tag size for each category can be estimated.The reason is as follows: 5.2.2 Reducing the Variance through Repeated Tests Assume that there exists m categories C1,C2,...,Cm,the As the frame size for each query cycle has a maximum overall tag size is n,and the tag size for each category is value,by estimating from the ensemble sampling within m,n2,...,nm.We define an indicator variable Xij to denote only one query cycle,the estimated tag size may not be whether one tag of category Ci selects a slot j inside the accurate enough for the accuracy constraint.In this situa- frame with the size f.We set Xij=1 if only one tag of cate- tion,multiple query cycles are essential to reduce the vari- gory Ci selects the slot j,and Xi.j=0 otherwise.Moreover, ance through repeated tests.Suppose the reader issues l we use Pr[Xij=1]to denote the probability that only one query cycles over the same set of categories,in regard to a tag of category Ci selects the slot j,then, specified category Ci,by utilizing the weighted statistical averaging method,the averaged tag size=n.k; here k= and respectively denote the esti If we use n.i to denote the number of singleton slots mated tag size and variance for each cycle k.Then,the vari- selected by tags of category Ci,thus n=then, ance ofs∑克 the expected value Therefore,according to the accuracy constraint in the problem formulation,we rely on the following theorem to express this constraint in the form of the variance. Theorem 2.Suppose the variance of the averaged tag size ni is o.The accuracy constraint is satisfied for a specified cate- Furthermore,let n,denote the number of singleton slots,the expected value E(n)=(1"n.Then,Thus 8onyC,as long as≤(z-e2.n,Zi-is the1-号per centile for the standard normal distribution. we can approximate the tag size of category Ci as follows: Proof.See Appendix C,available in the online supplemental 元,=n.元 material. ▣ (6) ns According to Theorem 2,we can verify if the accuracy Here,n is the estimated value of the overall tag size.Let constraint is satisfied for each category through directly d=hem元=:元 checking the variance against the threshold (If 1-B=95%,then Z1-/2=1.96. 5.2 Accuracy Analysis 5.2.1 Accuracy of the ES Estimator 5.2.3 Property of the Ensemble Sampling In the ensemble sampling-based estimation,since the esti- According to Theorem 1,the normalized variance of the SE mators such as [9],[24],[32]can be utilized for estimating estimator=产is equivalent to,k=告·=+ the overall number of tags,we use 8 to denote the variance +2e-.Leta=ne出,b=+ng-.Then,the nor- -(eP+n-1) of n.We have the property in Lemma 1. e+n-1 n-(eP+n-1) malized variance A:=a.+b.Since the SE estimator can Lemma 1.The number of singleton slots ns and the number of utilize any estimator like [9],[24],[32]to estimate the overall singleton slots nsi selected by the tags of category Ci,respec- tag size,then,without loss of generality,if we use the esti- tively,have the following expectations: mator in [9],we can prove that a<0 for any value of n >0,f>0.The following theorem shows this property in )=(-n+号((1-)22-m the normalized variance. E(m)=(1-》·n+号.((1-)·(m-m) Theorem 3.<for any walue of nf>0. Proof.See Appendix D,available in the online supplemen- Proof.See Appendix A,which can be found on the Computer tal material. Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2014.2357021. This property applies to any estimator with variance 0 smaller than 8o in ZE,which simply estimates the overall We rely on the following theorem to illustrate the accu-tag size based on the observed number of empty slots. racy of the estimator SE. According to Theorem 3,in order to satisfy the accuracy Theorem 1.Let 8;represent the variance of the estimator SE, cosit,we should ensureAsforall the load factor p=then, values of f,it infers that the larger the value ni is,the faster it will be for the specified category to satisfy the accuracy con- n5eP+mi-1.(6+m2)-m 6= (7) straint.On the contrary,the smaller the value n;is,the slower n e+n-1 it will be for the specified category to satisfy the accuracy
can use estimators in [9], [24], [32] etc. to estimate the overall tag size. Then, according to the response in each singleton slot, the category ID is obtained from the first s bits in the tag ID. Based on the sampling from the singleton slots, the tag size for each category can be estimated. The reason is as follows: Assume that there exists m categories C1; C2; ... ; Cm, the overall tag size is n, and the tag size for each category is n1; n2; ... ; nm. We define an indicator variable Xi;j to denote whether one tag of category Ci selects a slot j inside the frame with the size f. We set Xi;j ¼ 1 if only one tag of category Ci selects the slot j, and Xi;j ¼ 0 otherwise. Moreover, we use Pr½Xi;j ¼ 1 to denote the probability that only one tag of category Ci selects the slot j, then, Pr½Xi;j ¼ 1 ¼ 1 f 1 1 f n1 ni: If we use ns;i to denote the number of singleton slots selected by tags of category Ci, thus ns;i ¼ Pf j¼1 Xi;j, then, the expected value Eðns;iÞ ¼ X f j¼1 Pr½Xi;j ¼ 1 ¼ 1 1 f n1 ni: Furthermore, let ns denote the number of singleton slots, the expected value EðnsÞ¼ð1 1 fÞ n1 n. Then, Eðns;iÞ EðnsÞ ¼ ni n . Thus we can approximate the tag size of category Ci as follows: nbi ¼ ns;i ns n: b (6) Here, nb is the estimated value of the overall tag size. Let abi ¼ ns;i ns , then nbi ¼ abi nb. 5.2 Accuracy Analysis 5.2.1 Accuracy of the ES Estimator In the ensemble sampling-based estimation, since the estimators such as [9], [24], [32] can be utilized for estimating the overall number of tags, we use d to denote the variance of nb. We have the property in Lemma 1. Lemma 1. The number of singleton slots ns and the number of singleton slots ns;i selected by the tags of category Ci, respectively, have the following expectations: Eðn2 s Þ ¼ 1 1 f n1 n þ f1 f 1 2 f n2 ðn2 nÞ; Eðn2 s;iÞ ¼ 1 1 f n1 ni þ f1 f 1 2 f n2 ðn2 i niÞ: 8 >< >: Proof. See Appendix A, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2014.2357021. tu We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1. Let di represent the variance of the estimator SE nbi, the load factor r ¼ n f, then, di ¼ ni n er þ ni 1 er þ n 1 ðd þ n2 Þ n2 i : (7) Proof. See Appendix B, available in the online supplemental material. tu 5.2.2 Reducing the Variance through Repeated Tests As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues l query cycles over the same set of categories, in regard to a specified category Ci, by utilizing the weighted statistical averaging method, the averaged tag size nbi ¼ Pl k¼1 vk nci;k; here vk ¼ 1 d P i;k l k¼1 1 di;k , nci;k and di;k respectively denote the estimated tag size and variance for each cycle k. Then, the variance of nbi is s2 i ¼ P 1 l k¼1 1 di;k . Therefore, according to the accuracy constraint in the problem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2. Suppose the variance of the averaged tag size nbi is s2 i . The accuracy constraint is satisfied for a specified category Ci, as long as s2 i ð Z1b=2 Þ 2 n2 i , Z1b=2 is the 1 b 2 percentile for the standard normal distribution. Proof. See Appendix C, available in the online supplemental material. tu According to Theorem 2, we can verify if the accuracy constraint is satisfied for each category through directly checking the variance against the threshold ð Z1b=2 Þ 2 n2 i . If 1 b ¼ 95%, then Z1b=2 ¼ 1:96. 5.2.3 Property of the Ensemble Sampling According to Theorem 1, the normalized variance of the SE estimator i ¼ di ni is equivalent to i ¼ dner þ n er þ n 1 ni n þ ðd þ n2Þðer 1Þ nðer þ n 1Þ . Let a ¼ d ner þ n er þ n 1 , b ¼ ðd þ n2Þðer 1Þ nðer þ n 1Þ . Then, the normalized variance i ¼ a ni n þ b. Since the SE estimator can utilize any estimator like [9], [24], [32] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [9], we can prove that a < 0 for any value of n > 0;f> 0. The following theorem shows this property in the normalized variance. Theorem 3. d ner þ n er þ n 1 < 0 for any value of n > 0;f> 0. Proof. See Appendix D, available in the online supplemental material. tu This property applies to any estimator with variance smaller than d0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 3, in order to satisfy the accuracy constraint, we should ensure i ð Z1b=2 Þ 2 ni. As a < 0 for all values of f, it infers that the larger the value ni is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value ni is, the slower it will be for the specified category to satisfy the accuracy XIE ET AL.: EFFICIENT PROTOCOLS FOR COLLECTING HISTOGRAMS IN LARGE-SCALE RFID SYSTEMS 2425