当前位置：和泉文库 > 计算机 > 浏览文档

计算机科学与技术（参考文献）Efficiently Collecting Histograms Over RFID Tags

文件格式：PDF，文件大小：212.49KB，售价：3.51元

文档详细内容（约9页）

Efficiently Collecting Histograms Over RFID Tags Lei Xief,Hao Hant,Qun Lit,Jie Wus,and Sanglu Lut fState Key Laboratory for Novel Software Technology,Nanjing University,China Department of Computer Science,College of William and Mary,USA SDepartment of Computer Information and Sciences,Temple University,USA Email:TIxie@nju.edu.cn,fhhan,liqun@cs.wm.edu,3jiewu@temple.edu,sanglu@nju.edu.cn Abstract-Collecting histograms over RFID tags is an essential While dealing with a large scale deployment with thousands premise for effective aggregate queries and analysis in large- of tags,the traditional tag identification scheme is not suitable scale RFID-based applications.In this paper we consider efficient for histogram collection,since the scanning time is proportional collection of histograms from the massive number of RFID tags without the need to read all tag data.We first consider the problem to the number of tags,which can be in the order of several of basic histogram collection and propose an efficient algorithm minutes.As in most applications,the tags are frequently based on the idea of ensemble sampling.We further consider the moving into and out of the effective scanning area.In order problems of advanced histogram collection,respectively,with an to capture the distribution statistics in time,it is essential to iceberg query and a top-k query.Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified sacrifice some accuracy so that the main distribution can be categories can be quickly identified.Experiment results indi- obtained within a short time window in the order of several cate that our ensemble sampling-based solutions can achieve a seconds.Therefore,we seek to propose an estimation scheme much better performance than the basic estimation/identification to quickly count the tag sizes of each category,while achieving schemes. the accuracy requirement. Index Terms-Algorithms;RFID;Time efficiency;Histogram In most cases,the tag sizes of various categories are subject to some skew distribution with a "long tail".such as the I.INTRODUCTION Gaussian distribution.The long tail represents a large number With the rapid proliferation of RFID-based applications, of categories,each of which occupies a rather small percentage RFID tags have been deployed into pervasive spaces in increas- among the total categories.While handling the massive tags in ingly large numbers.In applications like warehouse monitoring, the order of several thousands,the overall number of categories the items are attached with RFID tags and densely packed in the long tail could be in hundreds.Therefore,by separately into boxes.As the maximum scanning range of a UHF RFID estimating the tag sizes over each category,a large number of reader is usually 6-10m,the overall number of tags within query cycles and slots are required.Besides,in applications this three-dimensional space can be up to tens of thousands like the iceberg query and the top-k query,only those major in a dense deployment scenario,as envisioned in [1-3].Many categories are essential to be addressed.In this situation,the identification protocols are proposed to uniquely identify the separate estimate approach will waste a lot of scanning time tags through anti-collision schemes.However,in a number of over those minor categories in the long tail.Therefore,a novel applications,only some useful statistical information is essen- scheme is essential to quickly collect the histograms over the tial to be collected,such as the overall tag size [2,4],popular massive RFID tags.In this paper,we propose a series of categories [5],and the histogram.In particular,histograms protocols to tackle the problem of efficient histogram collection. capture distribution statistics in a space-efficient fashion. We make the following contributions. In practice,tags are typically attached to objects belonging to 1)To the best of our knowledge,we are the first to consider different categories,e.g.,different brands and models of clothes the problem of collecting histograms over RFID tags,which is in a large clothing store,different titles of books in a book a fundamental premise for queries and analysis in RFID-based store,etc.Collecting histograms can be used to illustrate the tag applications.In order to achieve time efficiency,we propose a population belonging to each category,and determine whether novel,ensemble sampling-based method to simultaneously esti- the number of tags in a category is above or below any desired mate the tag size for a number of categories.This framework is threshold.By setting this threshold,it is easy to find popular very flexible and compatible to current tag-counting estimators, merchandise and control stock,e.g.,automatically signaling which can be efficiently leveraged to estimate the tag size for when more products need to be put on the shelf.Furthermore, each category.2)In order to tackle the histogram collection the histogram can be used for approximate answering of aggre- with a filter condition,we propose an effective solution for gate queries,as well as preprocessing and mining association the iceberg query problem.By considering the population and rules in data mining [6].Therefore,collecting histograms over accuracy constraint,we propose an efficient algorithm to wipe RFID tags is an essential premise for effective queries and out the unqualified categories in time,especially those cate- analysis in conventional RFID-based applications. gories in the long tail.We further present an effective solution 978-1-4799-3360-0/14/$31.00⊙2014IEEE

Efficiently Collecting Histograms Over RFID Tags Lei Xie†, Hao Han‡, Qun Li‡, Jie Wu§, and Sanglu Lu† †State Key Laboratory for Novel Software Technology, Nanjing University, China ‡Department of Computer Science, College of William and Mary, USA §Department of Computer Information and Sciences, Temple University, USA Email: †lxie@nju.edu.cn, ‡{hhan,liqun}@cs.wm.edu, §jiewu@temple.edu, †sanglu@nju.edu.cn Abstract—Collecting histograms over RFID tags is an essential premise for effective aggregate queries and analysis in largescale RFID-based applications. In this paper we consider efficient collection of histograms from the massive number of RFID tags without the need to read all tag data. We first consider the problem of basic histogram collection and propose an efficient algorithm based on the idea of ensemble sampling. We further consider the problems of advanced histogram collection, respectively, with an iceberg query and a top-𝑘 query. Efficient algorithms are proposed to tackle the above problems such that the qualified/unqualified categories can be quickly identified. Experiment results indicate that our ensemble sampling-based solutions can achieve a much better performance than the basic estimation/identification schemes. Index Terms—Algorithms; RFID; Time efficiency; Histogram I. INTRODUCTION With the rapid proliferation of RFID-based applications, RFID tags have been deployed into pervasive spaces in increasingly large numbers. In applications like warehouse monitoring, the items are attached with RFID tags and densely packed into boxes. As the maximum scanning range of a UHF RFID reader is usually 6-10m, the overall number of tags within this three-dimensional space can be up to tens of thousands in a dense deployment scenario, as envisioned in [1–3]. Many identification protocols are proposed to uniquely identify the tags through anti-collision schemes. However, in a number of applications, only some useful statistical information is essential to be collected, such as the overall tag size [2, 4], popular categories [5], and the histogram. In particular, histograms capture distribution statistics in a space-efficient fashion. In practice, tags are typically attached to objects belonging to different categories, e.g., different brands and models of clothes in a large clothing store, different titles of books in a book store, etc. Collecting histograms can be used to illustrate the tag population belonging to each category, and determine whether the number of tags in a category is above or below any desired threshold. By setting this threshold, it is easy to find popular merchandise and control stock, e.g., automatically signaling when more products need to be put on the shelf. Furthermore, the histogram can be used for approximate answering of aggregate queries, as well as preprocessing and mining association rules in data mining [6]. Therefore, collecting histograms over RFID tags is an essential premise for effective queries and analysis in conventional RFID-based applications. While dealing with a large scale deployment with thousands of tags, the traditional tag identification scheme is not suitable for histogram collection, since the scanning time is proportional to the number of tags, which can be in the order of several minutes. As in most applications, the tags are frequently moving into and out of the effective scanning area. In order to capture the distribution statistics in time, it is essential to sacrifice some accuracy so that the main distribution can be obtained within a short time window in the order of several seconds. Therefore, we seek to propose an estimation scheme to quickly count the tag sizes of each category, while achieving the accuracy requirement. In most cases, the tag sizes of various categories are subject to some skew distribution with a “long tail”, such as the Gaussian distribution. The long tail represents a large number of categories, each of which occupies a rather small percentage among the total categories. While handling the massive tags in the order of several thousands, the overall number of categories in the long tail could be in hundreds. Therefore, by separately estimating the tag sizes over each category, a large number of query cycles and slots are required. Besides, in applications like the iceberg query and the top-𝑘 query, only those major categories are essential to be addressed. In this situation, the separate estimate approach will waste a lot of scanning time over those minor categories in the long tail. Therefore, a novel scheme is essential to quickly collect the histograms over the massive RFID tags. In this paper, we propose a series of protocols to tackle the problem of efficient histogram collection. We make the following contributions. 1) To the best of our knowledge, we are the first to consider the problem of collecting histograms over RFID tags, which is a fundamental premise for queries and analysis in RFID-based applications. In order to achieve time efficiency, we propose a novel, ensemble sampling-based method to simultaneously estimate the tag size for a number of categories. This framework is very flexible and compatible to current tag-counting estimators, which can be efficiently leveraged to estimate the tag size for each category. 2) In order to tackle the histogram collection with a filter condition, we propose an effective solution for the iceberg query problem. By considering the population and accuracy constraint, we propose an efficient algorithm to wipe out the unqualified categories in time, especially those categories in the long tail. We further present an effective solution 978-1-4799-3360-0/14/$31.00 ⃝c 2014 IEEE

to tackle the top-k query problem.We use ensemble sampling be selected,the reader can provide multiple bit masks in the to quickly estimate the threshold corresponding to the k-th Select command. largest category,and reduce it to the iceberg query problem.3) While achieving time-efficiency,our solutions are completely B.The Impact of the Inter-cycle Overhead compatible with current industry standards,i.e.,the EPCglobal The MAC protocol for the C1G2 system is based on s- C1G2 standards,and do not require any modification to tags. lotted ALOHA.In order to accurately estimate the size of a specified set of tags,conventionally,the reader should issue II.RELATED WORK multiple query cycles over the same set of tags and take the In RFID systems,a reader needs to receive data from average of the estimates.The inter-cycle overhead consists of multiple tags,while the tags are unable to self-regulate their the time between cycles when the reader is powered down, radio transmissions to avoid collisions;then,a series of s-and the carrier time used to power the tags before beginning lotted ALOHA-based anti-collision protocols are designed to communication.According to the experiment results in [14], efficiently resolve collisions in RFID systems.In order to deal which are conducted in realistic settings,these times are 40 ms with the collision problems in multi-reader RFID systems, and 3 ms respectively,while the average time interval per slot scheduling protocols for reader activation are explored in [7,8]. is 1~2 ms.We have further measured the time interval for Recently,a number of polling protocols [9,10]are proposed, various slots with the USRP N210 platform,it is found that aiming to collect information from battery-powered active tags the average interval for empty slots is 1.6 ms per slot,and the in an energy efficient approach. average interval for singleton/collision slots is 5.1 ms per slot. Recent research is focused on the collection of statistical Note that if the powered-down interval is not long enough,it is information over the RFID tags [2,4,5,11-13].The authors possible that some surrounding tags will maintain the former mainly consider the problem of estimating the number of tags state for the inventoried flag with their local residual power, without collecting the tag IDs.In [4],Murali et al.provide very which causes them to keep silent in the upcoming query cycle. fast and reliable estimation mechanisms for tag quantity in a Therefore,the inter-cycle duration must be taken into account more practical approach.In [2],Shahzad et al.propose a new when considering overall reading performance.It is obvious scheme for estimating tag population size called Average Run that reading a larger number of tags per cycle amortizes the cost based Tag estimation.In [5],Sheng et al.consider the problem of inter-cycle overhead,resulting in lower per tag reading time, of identifying popular categories of RFID tags out of a large while for small tag sets the inter-cycle overhead is significant. collection of tags,while the set of category IDs are supposed to be known.In this paper,our goal is to collect the histograms IV.PROBLEM FORMULATION for all categories over RFID tags in a time-efficient approach, Suppose there are a large number of tags in the effective without any priori knowledge of the categories,including the scanning area of the RFID reader,the RFID system conforms number of categories and the category IDs. to EPCglobal C1G2 standards.i.e..the slotted ALOHA-based anti-collision scheme is used in the system model.The objective III.PRELIMINARY is to collect the histogram over RFID tags according to some A.The framed slotted ALOHA protocol categorized metric,e.g,the type of merchandise,while the In the Class 1 Gen 2 standard,the RFID system leverages present set of category IDs cannot be predicted in advance. the framed slotted ALOHA protocol to resolve the collisions for As we aim at a dynamic environment where the tags may tag identification.When a reader wishes to read a set of tags, frequently enter and leave the scanning area,a time-efficient it first powers up and transmits a continuous wave to energize strategy must be proposed.Therefore,the specified accuracy the tags.It then initiates a series of frames,varying the number can be relaxed in order to quickly collect the histogram. of slots in each frame to best accommodate the number of Assume that the overall tag size is n,there exists m categories tags.Each frame has a number of slots and each active tag C=[C1,C2,...,Cm),and the actual tag size for each category will reply in a randomly selected slot per frame.After all tags is n1,n2,...,nm. are read,the reader powers down.We refer to the series of In the Basic Histogram Collection,the RFID system needs frames between power down periods as a Query Cycle.Note to collect the histogram for all categories.Due to the inherent that,within each frame,tags may choose the same slot,which inaccurate property for RFID systems,users can specify the causes multiple tags to reply during a slot.Therefore,within accuracy requirement for the histogram collection.Suppose the each frame there exist three kinds of slots:(1)the empty slot estimated tag size for category Ci(1<i<m)is ni,then the where no tag replies;(2)the single slot where only one tag following accuracy constraint should be satisfied: replies;and (3)the collision slot where multiple tags reply. Pr[mi-nil≤e·ni≥1-B accuracy constraint.. (1) In regard to the tag ID,each tag has a unique 96-bit ID in its EPC memory,where the first s binary bits can be regarded as The accuracy constraint illustrates that,given the exact tag size the category ID(1 <s<96).According to the C1G2 standard,ni for a specified category,the estimated tag size n should for each QueryCycle,the reader is able to select the tags in a be in a confidence interval of width 2e[1 specified category by sending a Select command with an s-bit e,1+e with a probability greater than 1-B.For example,if mask in the category ID field.If multiple categories need to e=0.1,B=0.05,then in regard to a category with tag size

to tackle the top-k query problem. We use ensemble sampling to quickly estimate the threshold corresponding to the 𝑘-th largest category, and reduce it to the iceberg query problem. 3) While achieving time-efficiency, our solutions are completely compatible with current industry standards, i.e., the EPCglobal C1G2 standards, and do not require any modification to tags. II. RELATED WORK In RFID systems, a reader needs to receive data from multiple tags, while the tags are unable to self-regulate their radio transmissions to avoid collisions; then, a series of slotted ALOHA-based anti-collision protocols are designed to efficiently resolve collisions in RFID systems. In order to deal with the collision problems in multi-reader RFID systems, scheduling protocols for reader activation are explored in [7, 8]. Recently, a number of polling protocols [9, 10] are proposed, aiming to collect information from battery-powered active tags in an energy efficient approach. Recent research is focused on the collection of statistical information over the RFID tags [2, 4, 5, 11–13]. The authors mainly consider the problem of estimating the number of tags without collecting the tag IDs. In [4], Murali et al. provide very fast and reliable estimation mechanisms for tag quantity in a more practical approach. In [2], Shahzad et al. propose a new scheme for estimating tag population size called Average Run based Tag estimation. In [5], Sheng et al. consider the problem of identifying popular categories of RFID tags out of a large collection of tags, while the set of category IDs are supposed to be known. In this paper, our goal is to collect the histograms for all categories over RFID tags in a time-efficient approach, without any priori knowledge of the categories, including the number of categories and the category IDs. III. PRELIMINARY A. The framed slotted ALOHA protocol In the Class 1 Gen 2 standard, the RFID system leverages the framed slotted ALOHA protocol to resolve the collisions for tag identification. When a reader wishes to read a set of tags, it first powers up and transmits a continuous wave to energize the tags. It then initiates a series of frames, varying the number of slots in each frame to best accommodate the number of tags. Each frame has a number of slots and each active tag will reply in a randomly selected slot per frame. After all tags are read, the reader powers down. We refer to the series of frames between power down periods as a Query Cycle. Note that, within each frame, tags may choose the same slot, which causes multiple tags to reply during a slot. Therefore, within each frame there exist three kinds of slots: (1) the empty slot where no tag replies; (2) the single slot where only one tag replies; and (3) the collision slot where multiple tags reply. In regard to the tag ID, each tag has a unique 96-bit ID in its EPC memory, where the first 𝑠 binary bits can be regarded as the category ID (1 <𝑠< 96). According to the C1G2 standard, for each Query Cycle, the reader is able to select the tags in a specified category by sending a Select command with an 𝑠-bit mask in the category ID field. If multiple categories need to be selected, the reader can provide multiple bit masks in the Select command. B. The Impact of the Inter-cycle Overhead The MAC protocol for the C1G2 system is based on slotted ALOHA. In order to accurately estimate the size of a specified set of tags, conventionally, the reader should issue multiple query cycles over the same set of tags and take the average of the estimates. The inter-cycle overhead consists of the time between cycles when the reader is powered down, and the carrier time used to power the tags before beginning communication. According to the experiment results in [14], which are conducted in realistic settings, these times are 40 ms and 3 ms respectively, while the average time interval per slot is 1 ∼ 2 ms. We have further measured the time interval for various slots with the USRP N210 platform, it is found that the average interval for empty slots is 1.6 ms per slot, and the average interval for singleton/collision slots is 5.1 ms per slot. Note that if the powered-down interval is not long enough, it is possible that some surrounding tags will maintain the former state for the inventoried flag with their local residual power, which causes them to keep silent in the upcoming query cycle. Therefore, the inter-cycle duration must be taken into account when considering overall reading performance. It is obvious that reading a larger number of tags per cycle amortizes the cost of inter-cycle overhead, resulting in lower per tag reading time, while for small tag sets the inter-cycle overhead is significant. IV. PROBLEM FORMULATION Suppose there are a large number of tags in the effective scanning area of the RFID reader, the RFID system conforms to EPCglobal C1G2 standards, i.e., the slotted ALOHA-based anti-collision scheme is used in the system model. The objective is to collect the histogram over RFID tags according to some categorized metric, e.g, the type of merchandise, while the present set of category IDs cannot be predicted in advance. As we aim at a dynamic environment where the tags may frequently enter and leave the scanning area, a time-efficient strategy must be proposed. Therefore, the specified accuracy can be relaxed in order to quickly collect the histogram. Assume that the overall tag size is 𝑛, there exists 𝑚 categories 𝐶 = {𝐶1, 𝐶2, ..., 𝐶𝑚}, and the actual tag size for each category is 𝑛1, 𝑛2, ..., 𝑛𝑚. In the Basic Histogram Collection, the RFID system needs to collect the histogram for all categories. Due to the inherent inaccurate property for RFID systems, users can specify the accuracy requirement for the histogram collection. Suppose the estimated tag size for category 𝐶𝑖(1 ≤ 𝑖 ≤ 𝑚) is 𝑛ˆ𝑖, then the following accuracy constraint should be satisfied: 𝑃 𝑟[∣𝑛ˆ𝑖 − 𝑛𝑖∣ ≤ 𝜖 ⋅ 𝑛𝑖] ≥ 1 − 𝛽 accuracy constraint. (1) The accuracy constraint illustrates that, given the exact tag size 𝑛𝑖 for a specified category, the estimated tag size 𝑛ˆ𝑖 should be in a confidence interval of width 2𝜖 ⋅ 𝑛𝑖, i.e., 𝑛ˆ𝑖 𝑛𝑖 ∈ [1 − 𝜖, 1 + 𝜖] with a probability greater than 1 − 𝛽. For example, if 𝜖 = 0.1, 𝛽 = 0.05, then in regard to a category with tag size

ni=100,the estimated tag size n should be within the range categories m can be fairly large,e.g.,in the order of hundreds [90,110]with a probability greater than 95%. the Select command and the fixed initial frame size for each In the Iceberg Ouery Problem,only those categories with a category,as well as the inter-cycle overhead among a large tag size over a specified threshold t are essential to be illustrated number of query cycles,make the overall scanning time rather in the histogram,while the accuracy requirement is satisfied. large. As the exact tag size ni for category Ci is unknown,then, Therefore,we consider an ensemble sampling-based estima- given the estimated value of tag size ni,it is possible to have tion scheme as follows:select a certain number of categories false negative error and false positive error in verifying the and issue a query cycle,obtain the empty/singleton/collision population constraint.Therefore,it is essential to guarantee that slots,and then estimate the tag size for each of the categories the false negative/positive rate is below B,that is: according to the sampling in the singleton slots.In this way,the Pr[m<tn:≥t<B, ensemble sampling is more preferred than the separate counting (2) in terms of reading performance.Since more tags are involved Prm≥tn:<t<B. (3) in one query cycle,more slots amortize the cost of inter-cycle In the Top-k Query Problem,we use the definition of the overhead,the Select command,as well as the fixed initial frame probabilistic threshold top-k query (PT-Topk query),i.e.,in size.Thus,the overall scanning time can be greatly reduced. regard to the tag size,only the set of categories where each A.The Estimator SE takes a probability of at least 1-B to be in the top-k list are In the slotted ALOHA-based protocol,besides the empty illustrated in the histogram,while the accuracy requirement is slots and the collision slots,the singleton slots can be obtained. satisfied.Much like the iceberg query problem,as the exact tag size ni for category Ci is unknown,then,given the estimated In the ensemble sampling-based estimation,according to the observed statistics of the empty/singleton/collision slots,we can value of tag size ni,it is possible to have false negative error use estimators in [4]12]15]etc.to estimate the overall tag and false positive error in verifying the population constraint, the following constraint must be satisfied: size.Then,according to the response in each singleton slot, the category ID is obtained from the first s bits in the tag ID. Pr[Ciis regarded out of top-k list|Ci E top-k list]<B,(4) Based on the sampling from the singleton slots,the tag size for Pr[C;is regarded in top-k list Ci top-k list]<B.(5) each category can be estimated.The reason is as follows: Assume that there exists m categories C1,C2,...,Cm,the In this paper,we aim to propose a series of novel solutions overall tag size is n,and the tag size for each category is to tackle the above problems,while satisfying the following n,n2,...,nm.We define an indicator variable Xij to denote properties:(1)Time-efficient.(2)Simple for the tag side in the whether one tag of category Ci selects a slot j inside the frame protocol.(3)Complies with the EPCglobal C1G2 standards. with the size f.We set Xi.j=1 if only one tag of category V.USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS Ci selects the slot j,and Xi.j=0 otherwise.Moreover,we use Pr[Xi.j=1]to denote the probability that only one tag When collecting the histograms over a large number of of category Ci selects the slot j,then, categories,the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints Pr[X=1刂= are satisfied.Two straightforward solutions are summarized as follows:(1)Basic Tag Identification:The histogram is collected If we use ns.i to denote the number of singleton slots selected by uniquely identifying each tag from the massive tag set and by tags of category Ci,thus ns.i= ∑=lXi,then,the putting it into the corresponding categories,thus the accuracy expected value is 100%,and(2)Separate Counting:The histogram is collected by estimating the number of tags in each category one by E(nsi)= Pr[Xi=1=(1-F)n-1.n one.Specifically,for each category,the reader broadcasts a i= Select command to activate the tags in the specified category. According to the replies from the specified tags,the estimators Furthermore,let ns denote the number of singleton slots,the like [4][12][15]can be used to estimate the tag size for each expected value E(ns)=(1-1)n-1.n.Then, =兴 E(n。) category.As the rough tag size for each category cannot be Thus we can approximate the tag size of category C as follows: predicted,a fixed initial frame size is used for each category. si.亢元= (6 Both of the above two solutions are not time-efficient.In Ta regard to the basic tag identification,uniquely identifying each Here,is the estimated value of the overall tag size.Let= tag in the massive set is not scalable,for as the tag size grows ,hen应=元 into a huge number,the scanning time can be an unacceptable value.In regard to the separated counting,the reader needs to B.Accuracy Analysis scan each category with at least one query cycle,even if the 1)Accuracy of the SE Estimator:In the ensemble sampling- category is a minor category,which is not necessarily addressed based estimation,since the estimators such as [4][12][15]can in the iceberg query or the top-k query.As the number of be utilized for estimating the overall number of tags,we use 6

𝑛𝑖 = 100, the estimated tag size 𝑛ˆ𝑖 should be within the range [90, 110] with a probability greater than 95%. In the Iceberg Query Problem, only those categories with a tag size over a specified threshold 𝑡 are essential to be illustrated in the histogram, while the accuracy requirement is satisfied. As the exact tag size 𝑛𝑖 for category 𝐶𝑖 is unknown, then, given the estimated value of tag size 𝑛ˆ𝑖, it is possible to have false negative error and false positive error in verifying the population constraint. Therefore, it is essential to guarantee that the false negative/positive rate is below 𝛽, that is: 𝑃 𝑟[𝑛ˆ𝑖 < 𝑡∣𝑛𝑖 ≥ 𝑡] < 𝛽, (2) 𝑃 𝑟[𝑛ˆ𝑖 ≥ 𝑡∣𝑛𝑖 < 𝑡] < 𝛽. (3) In the Top-k Query Problem, we use the definition of the probabilistic threshold top-𝑘 query (PT-Topk query), i.e., in regard to the tag size, only the set of categories where each takes a probability of at least 1 − 𝛽 to be in the top-𝑘 list are illustrated in the histogram, while the accuracy requirement is satisfied. Much like the iceberg query problem, as the exact tag size 𝑛𝑖 for category 𝐶𝑖 is unknown, then, given the estimated value of tag size 𝑛ˆ𝑖, it is possible to have false negative error and false positive error in verifying the population constraint, the following constraint must be satisfied: 𝑃 𝑟[𝐶𝑖is regarded out of top-𝑘 list∣𝐶𝑖 ∈ top-𝑘 list] < 𝛽, (4) 𝑃 𝑟[𝐶𝑖is regarded in top-𝑘 list∣𝐶𝑖 ∈/ top-𝑘 list] < 𝛽. (5) In this paper, we aim to propose a series of novel solutions to tackle the above problems, while satisfying the following properties: (1) Time-efficient. (2) Simple for the tag side in the protocol. (3) Complies with the EPCglobal C1G2 standards. V. USE ENSEMBLE SAMPLING TO COLLECT HISTOGRAMS When collecting the histograms over a large number of categories, the objective is to minimize the overall scanning time while the corresponding accuracy/population constraints are satisfied. Two straightforward solutions are summarized as follows: (1) Basic Tag Identification: The histogram is collected by uniquely identifying each tag from the massive tag set and putting it into the corresponding categories, thus the accuracy is 100%, and (2) Separate Counting: The histogram is collected by estimating the number of tags in each category one by one. Specifically, for each category, the reader broadcasts a Select command to activate the tags in the specified category. According to the replies from the specified tags, the estimators like [4][12][15] can be used to estimate the tag size for each category. As the rough tag size for each category cannot be predicted, a fixed initial frame size is used for each category. Both of the above two solutions are not time-efficient. In regard to the basic tag identification, uniquely identifying each tag in the massive set is not scalable, for as the tag size grows into a huge number, the scanning time can be an unacceptable value. In regard to the separated counting, the reader needs to scan each category with at least one query cycle, even if the category is a minor category, which is not necessarily addressed in the iceberg query or the top-𝑘 query. As the number of categories 𝑚 can be fairly large, e.g., in the order of hundreds, the Select command and the fixed initial frame size for each category, as well as the inter-cycle overhead among a large number of query cycles, make the overall scanning time rather large. Therefore, we consider an ensemble sampling-based estimation scheme as follows: select a certain number of categories and issue a query cycle, obtain the empty/singleton/collision slots, and then estimate the tag size for each of the categories according to the sampling in the singleton slots. In this way, the ensemble sampling is more preferred than the separate counting in terms of reading performance. Since more tags are involved in one query cycle, more slots amortize the cost of inter-cycle overhead, the Select command, as well as the fixed initial frame size. Thus, the overall scanning time can be greatly reduced. A. The Estimator SE In the slotted ALOHA-based protocol, besides the empty slots and the collision slots, the singleton slots can be obtained. In the ensemble sampling-based estimation, according to the observed statistics of the empty/singleton/collision slots, we can use estimators in [4][12][15] etc. to estimate the overall tag size. Then, according to the response in each singleton slot, the category ID is obtained from the first 𝑠 bits in the tag ID. Based on the sampling from the singleton slots, the tag size for each category can be estimated. The reason is as follows: Assume that there exists 𝑚 categories 𝐶1, 𝐶2, ..., 𝐶𝑚, the overall tag size is 𝑛, and the tag size for each category is 𝑛1, 𝑛2, ..., 𝑛𝑚. We define an indicator variable 𝑋𝑖,𝑗 to denote whether one tag of category 𝐶𝑖 selects a slot 𝑗 inside the frame with the size 𝑓. We set 𝑋𝑖,𝑗 = 1 if only one tag of category 𝐶𝑖 selects the slot 𝑗, and 𝑋𝑖,𝑗 = 0 otherwise. Moreover, we use 𝑃 𝑟[𝑋𝑖,𝑗 = 1] to denote the probability that only one tag of category 𝐶𝑖 selects the slot 𝑗, then, 𝑃 𝑟[𝑋𝑖,𝑗 = 1] = 1 𝑓 ⋅ (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛𝑖. If we use 𝑛𝑠,𝑖 to denote the number of singleton slots selected by tags of category 𝐶𝑖, thus 𝑛𝑠,𝑖 = ∑𝑓 𝑗=1 𝑋𝑖,𝑗 , then, the expected value 𝐸(𝑛𝑠,𝑖) = ∑ 𝑓 𝑗=1 𝑃 𝑟[𝑋𝑖,𝑗 = 1] = (1 − 1 𝑓 ) 𝑛−1 ⋅ 𝑛𝑖. Furthermore, let 𝑛𝑠 denote the number of singleton slots, the expected value 𝐸(𝑛𝑠) = (1 − 1 𝑓 )𝑛−1 ⋅ 𝑛. Then, 𝐸(𝑛𝑠,𝑖) 𝐸(𝑛𝑠) = 𝑛𝑖 𝑛 . Thus we can approximate the tag size of category 𝐶𝑖 as follows: 𝑛ˆ𝑖 = 𝑛𝑠,𝑖 𝑛𝑠 ⋅ 𝑛. ˆ (6) Here, 𝑛ˆ is the estimated value of the overall tag size. Let 𝛼ˆ𝑖 = 𝑛𝑠,𝑖 𝑛𝑠 , then 𝑛ˆ𝑖 = 𝛼ˆ𝑖 ⋅ 𝑛ˆ. B. Accuracy Analysis 1) Accuracy of the SE Estimator: In the ensemble samplingbased estimation, since the estimators such as [4][12][15] can be utilized for estimating the overall number of tags, we use 𝛿

to denote the variance of ni.We rely on the following theorem C.Compute the Optimal Granularity for Ensemble Sampling to illustrate the accuracy of the estimator SE. Theorem I:Let o;represent the variance of the estimator SE As indicated in the above analysis,during a query cycle mi,the load factor p=子，then, of the ensemble sampling,in order to achieve the accuracy requirement for all categories,the essential scanning time d=%.eP+n-1 mainly depends on the category with the smallest tag size. n'co+n-1(6+n2)-n. (7) as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Proof:See Appendix A. Therefore,we use the notion group to define a set of categories 2)Reducing the Variance through Repeated Tests:As the involved in a query cycle of the ensemble sampling.Hence, frame size for each query cycle has a maximum value,by each cycle of ensemble sampling should be applied over an estimating from the ensemble sampling within only one query appropriate group,such that the variance of the tag sizes for cycle,the estimated tag size may not be accurate enough for the involved categories cannot be too large.In this way,all the accuracy constraint.In this situation,multiple query cycles categories in the same group achieve the accuracy requirement are essential to reduce the variance through repeated tests. with very close finishing time.In addition,according to Eq.(7), Suppose the reader issues l query cycles over the same set of as the number of categories increases in the ensemble sampling categories,in regard to a specified category Ci,by utilizing the the load factor p is increased,then the achieved accuracy for weighted statistical averaging method,the averaged tag size each involved category is reduced.Therefore,it is essential to 元=∑k=1uknk:here wk=】工，n,k and 6.k 1 compute an optimal granularity for the group in regard to the respectively denote the estimated tag size and variance for each reading performance.Suppose there exists m categories in total, cycle k.Then,the variance of元iso=zi本 the objective is to divide them into d(1 sd m)groups for ensemble sampling,such that the overall scanning time can be Therefore,according to the accuracy constraint in the prob- minimized while achieving the accuracy requirement. lem formulation,we rely on the following theorem to express For a specified group,in order for all involved categories to this constraint in the form of the variance. satisfy the accuracy requirement,it is essential to compute the Theorem 2:Suppose the variance of the averaged tag size required frame size for the category with the smallest tag size, nis o.The accuracy constraint is satisfied for a specified category C,as long as子≤(zaaP.n,Zi-aa is the say nLet(then,according to Theorem 2. we can compute the essential frame size f such that Ai(f)<ti. 1-percentile for the standard normal distribution. Assume that the inter-cycle overhead is Te,and the average Proof:See Appendix B. ■ time interval per slot is Ts.Therefore,if f fmaz,then the According to Theorem 2,we can verify if the accuracy con- total scanning time T f.Ts+Te.Otherwise,if the final straint is satisfied for each category through directly checking estimate is the average of r independent experiments each with the variance against the threshold If 1-5%. an estimator variance of Ai(fmaz),then the variance of the then Z1-B/2=1.96. average is).Hence,if we want the final variance to 3)Property of the Ensemble Sampling:According to The- bet thenr should be,the total scanning time is orem 1,the normalized variance of the SE estimator A;=5 n T=(fmax·Ts+Tc)·r. is equivalent to=(().Let n(eP+n-1)1 We propose a dynamic programming-based algorithm to ep+n-1 a 告.b=e号Then,the normalized compute the optimal granularity for ensemble sampling.As- ep-n-I variance i=a.+b.Since the SE estimator can utilize sume that currently there are m categories,ranked in non- any estimator like [4][12][15]to estimate the overall tag size, increasing order,according to the estimated tag size,e.g., then,without loss of generality,if we use the estimator in [4], C1,C2,....Cm.We need to cut the ranked categories into one or more continuous groups for ensemble sampling.In regard to a we can prove that a <0 for any value of n>0,f >0.This property applies to any estimator with variance smaller than 8o single group consisting of categories from Ci to Ci,we define t(i,j)as the essential scanning time for ensemble sampling, in ZE,which simply estimates the overall tag size based on the which is computed in the same way as aforementioned T.Fur- observed number of empty slots. According to Theorem 2,in order to satisfy the accuracy thermore,we define T(i,j)as the minimum overall scanning constraint,we should ensure that( )2.ni.As a<0 time over the categories from Ci to Ci among various grouping strategies.Then,the recursive expression of T(i,j)is shown for all values of f,it infers that the larger the value ni is, in Eq.(8) the faster it will be for the specified category to satisfy the accuracy constraint.On the contrary,the smaller the value n is,the slower it will be for the specified category to satisfy the T(i,)= minisksjt(i,k)+T(k+1,j)),i<j; t(i,i), (8) i=j. accuracy constraint.This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots,In Eq.(8),the value of T(i,j)is obtained by enumerating each while those minor categories cannot obtain enough samplings possible combination of t(i,k)and T(+1,j),and then getting in the singleton slots for accurate estimation of the tag size. the minimum value of t(i,)+T(k+1,j).By solving the

to denote the variance of 𝑛ˆ. We rely on the following theorem to illustrate the accuracy of the estimator SE. Theorem 1: Let 𝛿𝑖 represent the variance of the estimator SE 𝑛ˆ𝑖, the load factor 𝜌 = 𝑛 𝑓 , then, 𝛿𝑖 = 𝑛𝑖 𝑛 ⋅ 𝑒𝜌 + 𝑛𝑖 − 1 𝑒𝜌 + 𝑛 − 1 ⋅ (𝛿 + 𝑛2) − 𝑛2 𝑖 . (7) Proof: See Appendix A. 2) Reducing the Variance through Repeated Tests: As the frame size for each query cycle has a maximum value, by estimating from the ensemble sampling within only one query cycle, the estimated tag size may not be accurate enough for the accuracy constraint. In this situation, multiple query cycles are essential to reduce the variance through repeated tests. Suppose the reader issues 𝑙 query cycles over the same set of categories, in regard to a specified category 𝐶𝑖, by utilizing the weighted statistical averaging method, the averaged tag size 𝑛ˆ𝑖 = ∑𝑙 𝑘=1 𝜔𝑘 ⋅ 𝑛ˆ𝑖,𝑘; here 𝜔𝑘 = 1 𝛿 ∑ 𝑖,𝑘 𝑙 𝑘=1 1 𝛿𝑖,𝑘 , 𝑛ˆ𝑖,𝑘 and 𝛿𝑖,𝑘 respectively denote the estimated tag size and variance for each cycle 𝑘. Then, the variance of 𝑛ˆ𝑖 is 𝜎2 𝑖 = ∑ 1 𝑙 𝑘=1 1 𝛿𝑖,𝑘 . Therefore, according to the accuracy constraint in the problem formulation, we rely on the following theorem to express this constraint in the form of the variance. Theorem 2: Suppose the variance of the averaged tag size 𝑛ˆ𝑖 is 𝜎2 𝑖 . The accuracy constraint is satisfied for a specified category 𝐶𝑖, as long as 𝜎2 𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛2 𝑖 , 𝑍1−𝛽/2 is the 1 − 𝛽 2 percentile for the standard normal distribution. Proof: See Appendix B. According to Theorem 2, we can verify if the accuracy constraint is satisfied for each category through directly checking the variance against the threshold ( 𝜖 𝑍1−𝛽/2 )2⋅𝑛2 𝑖 . If 1−𝛽 = 95%, then 𝑍1−𝛽/2 = 1.96. 3) Property of the Ensemble Sampling: According to Theorem 1, the normalized variance of the SE estimator 𝜆𝑖 = 𝛿𝑖 𝑛𝑖 is equivalent to 𝜆𝑖 = 𝛿−𝑛⋅𝑒𝜌+𝑛 𝑒𝜌+𝑛−1 ⋅ 𝑛𝑖 𝑛 + (𝛿+𝑛2)(𝑒𝜌−1) 𝑛⋅(𝑒𝜌+𝑛−1) . Let 𝑎 = 𝛿−𝑛⋅𝑒𝜌+𝑛 𝑒𝜌+𝑛−1 , 𝑏 = (𝛿+𝑛2)(𝑒𝜌−1) 𝑛⋅(𝑒𝜌+𝑛−1) . Then, the normalized variance 𝜆𝑖 = 𝑎 ⋅ 𝑛𝑖 𝑛 + 𝑏. Since the SE estimator can utilize any estimator like [4][12][15] to estimate the overall tag size, then, without loss of generality, if we use the estimator in [4], we can prove that 𝑎 < 0 for any value of 𝑛 > 0,𝑓 > 0. This property applies to any estimator with variance smaller than 𝛿0 in ZE, which simply estimates the overall tag size based on the observed number of empty slots. According to Theorem 2, in order to satisfy the accuracy constraint, we should ensure that 𝜆𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅𝑛𝑖. As 𝑎 < 0 for all values of 𝑓, it infers that the larger the value 𝑛𝑖 is, the faster it will be for the specified category to satisfy the accuracy constraint. On the contrary, the smaller the value 𝑛𝑖 is, the slower it will be for the specified category to satisfy the accuracy constraint. This occurs during the ensemble sampling, when the major categories occupy most of the singleton slots, while those minor categories cannot obtain enough samplings in the singleton slots for accurate estimation of the tag size. C. Compute the Optimal Granularity for Ensemble Sampling As indicated in the above analysis, during a query cycle of the ensemble sampling, in order to achieve the accuracy requirement for all categories, the essential scanning time mainly depends on the category with the smallest tag size, as the other categories must still be involved in the query cycle until this category achieves the accuracy requirement. Therefore, we use the notion group to define a set of categories involved in a query cycle of the ensemble sampling. Hence, each cycle of ensemble sampling should be applied over an appropriate group, such that the variance of the tag sizes for the involved categories cannot be too large. In this way, all categories in the same group achieve the accuracy requirement with very close finishing time. In addition, according to Eq. (7), as the number of categories increases in the ensemble sampling, the load factor 𝜌 is increased, then the achieved accuracy for each involved category is reduced. Therefore, it is essential to compute an optimal granularity for the group in regard to the reading performance. Suppose there exists 𝑚 categories in total, the objective is to divide them into 𝑑(1 ≤ 𝑑 ≤ 𝑚) groups for ensemble sampling, such that the overall scanning time can be minimized while achieving the accuracy requirement. For a specified group, in order for all involved categories to satisfy the accuracy requirement, it is essential to compute the required frame size for the category with the smallest tag size, say 𝑛𝑖. Let 𝑡𝑖 = ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛𝑖, then, according to Theorem 2, we can compute the essential frame size 𝑓 such that 𝜆𝑖(𝑓) ≤ 𝑡𝑖. Assume that the inter-cycle overhead is 𝜏𝑐, and the average time interval per slot is 𝜏𝑠. Therefore, if 𝑓 ≤ 𝑓𝑚𝑎𝑥, then the total scanning time 𝑇 = 𝑓 ⋅ 𝜏𝑠 + 𝜏𝑐. Otherwise, if the final estimate is the average of 𝑟 independent experiments each with an estimator variance of 𝜆𝑖(𝑓𝑚𝑎𝑥), then the variance of the average is 𝜆𝑖(𝑓𝑚𝑎𝑥) 𝑟 . Hence, if we want the final variance to be 𝑡𝑖, then 𝑟 should be 𝜆𝑖(𝑓𝑚𝑎𝑥) 𝑡𝑖 , the total scanning time is 𝑇 = (𝑓𝑚𝑎𝑥 ⋅ 𝜏𝑠 + 𝜏𝑐) ⋅ 𝑟. We propose a dynamic programming-based algorithm to compute the optimal granularity for ensemble sampling. Assume that currently there are 𝑚 categories, ranked in nonincreasing order, according to the estimated tag size, e.g., 𝐶1, 𝐶2, ..., 𝐶𝑚. We need to cut the ranked categories into one or more continuous groups for ensemble sampling. In regard to a single group consisting of categories from 𝐶𝑖 to 𝐶𝑗 , we define 𝑡(𝑖, 𝑗) as the essential scanning time for ensemble sampling, which is computed in the same way as aforementioned 𝑇. Furthermore, we define 𝑇(𝑖, 𝑗) as the minimum overall scanning time over the categories from 𝐶𝑖 to 𝐶𝑗 among various grouping strategies. Then, the recursive expression of 𝑇(𝑖, 𝑗) is shown in Eq.(8). 𝑇(𝑖, 𝑗) = { min𝑖≤𝑘≤𝑗{𝑡(𝑖, 𝑘) + 𝑇(𝑘 + 1, 𝑗)}, 𝑖<𝑗; 𝑡(𝑖, 𝑖), 𝑖 = 𝑗. (8) In Eq. (8), the value of 𝑇(𝑖, 𝑗) is obtained by enumerating each possible combination of 𝑡(𝑖, 𝑘) and 𝑇(𝑘+1, 𝑗), and then getting the minimum value of 𝑡(𝑖, 𝑘) + 𝑇(𝑘 + 1, 𝑗). By solving the

overlapping subproblems in T(i,j),the optimization problem Algorithm 1 Algorithm for histogram collection is then reduced to computing the value of T(1,m). 1:Initialize the set R to all tags.Set l=1. For example,suppose there are a set of tags with 10 2:while ns≠0Ane≠0do categories,these categories are ranked in non-increasing order 3: If I=1,compute the initial frame size f by solving of the estimated tag size,say,{100,80,75,41,35,30, fe-/=5.Otherwise,compute the frame size f=n. 20,15,12,8,they are finally divided into 3 groups for If f>fmaz,set f=fmaz. ensemble sampling according to the dynamic programming, 4 Set S to Select the tags in R and issue a query cycle i.e.,{100,80,75},{41,35,30,and{20,15,12,8.In this way,. with the frame size f,get no,ne,n Find the category the tag size of each category inside one group is close to the with the largest population v in the singleton slots.For others.During the ensemble sampling,all categories in the each category which appears in the singleton slot with same group can achieve the accuracy requirement with very population ns.i>v.(is constant,0<<1),add it close finishing time;very few slots are wasted due to waiting to the set S. for those,comparatively speaking,minor categories.On the 5: Estimate the tag size ni for each category CiE S using other hand,these categories are put together with an appropriate the SE estimator.Compute the variances for each granularity for ensemble sampling,as to sufficiently amortize category C;E S according to Eq.(7). the fixed time cost for each query cycle. 6: Rank the categories in S in non-increasing order of the tag size.Divide the set S into groups S1,S2,...,Sa D.The Ensemble Sampling-based Algorithm according to the dynamic programming-based method. In Algorithm 1,we propose an ensemble sampling-based 7: for each S∈S(1≤j≤d)do algorithm for the basic histogram collection.In the beginning, 8: For each category CiE Sj,compute the frame size as the overall number of tags n cannot be predicted,in order f:from 6,by solving7a47石≤(zae2.i2. to accomodate a large operating range up to n,we need to set 9: Obtain the maximum frame size f maxces,fi. the initial frame size f by solving fe/=5,as suggested If f<fmar,select all categories in Sj,and issue in [4].Then,during each cycle of ensemble sampling,we find a query cycle with frame size f.Otherwise,select the category with the largest population v in the singleton slots, all categories in Si,and issue r query cycles with the and set a threshold ns.i>v.(0<<1)to filter out those frame size fma Wipe out the categories with satisfied minor categories which occasionally occupy a small number of accuracy after each query cycle. singleton slots.For example,suppose it is observed from the 10: Estimate the tag size n for each category CiE Sj, singleton slots that the number of slots occupied by various illustrate them in the histogram. categories are as follows:{35,25,10,5,3,1),if 0 is set to 0.1, 11: end for then the categories with the number of slots equal to 3 and 1 12: n=元-∑ces元.R=R-S.S=a.l=l+1. are filtered out from the next ensemble sampling.Therefore, 13:end while during the ensemble sampling,we can avoid estimating tag sizes for those minor categories with a rather large variance. Then,the involved categories are further divided into smaller the accuracy constraint,we have demonstrated in Theorem 1 groups based on the dynamic programming.Therefore,as those that it can be expressed in the form of the variance constraint. major categories are estimated and wiped out from the set R phase by phase,all categories,including the relatively minor In regard to the population constraint,the second constraint infers that,in the results of the iceberg query,the false negative categories,can be accurately estimated in terms of tag size.The query cycles continue to be issued until no singleton slots or probability should be no more than B,while the third constraint infers that the false positive probability should be no more than collision slots exist. B.We rely on the following theorem to express the population VI.ENSEMBLE SAMPLING FOR THE ICEBERG OUERY constraint in another equivalent form. A.Motivation Theorem 3:The two population constraints,Pr[ni<tni> In some applications,the users only pay attention to the t]<B and Pr[i>tni<t]<B.are satisfied as long as the standard deviation of the averaged tag size Ini-tl major categories with the tag sizes above a certain threshold t,while those minor categories are not necessarily addressed. (z)is the cumulative distribution function of the standard Then,the iceberg query [16]is utilized to filter out those normal distribution. categories below the threshold t in terms of the tag size.In Due to lack of space,we omit the proof of Theorem 3.The this situation,the separate counting scheme is especially not detailed proof is given in [17]. suitable,since most of the categories are not within the scope In order to better illustrate the inherent principle,Fig.I shows of the concern,which can be wiped out together immediately.an example of the histogram with the 1-B confidence interval According to the definition in the problem formulation,three annotated,the y-axis denotes the estimated tag size for each constraints for the iceberg query must be satisfied:The first category.In order to accurately verify the population constraint, constraint is the accuracy constraint,while the second and it is required that the variance of the estimated tag size should third constraints are the population constraints.In regard to be small enough.Note that when the 1-B confidence interval

overlapping subproblems in 𝑇(𝑖, 𝑗), the optimization problem is then reduced to computing the value of 𝑇(1, 𝑚). For example, suppose there are a set of tags with 10 categories, these categories are ranked in non-increasing order of the estimated tag size, say, {100, 80, 75, 41, 35, 30, 20, 15, 12, 8}, they are finally divided into 3 groups for ensemble sampling according to the dynamic programming, i.e., {100, 80, 75}, {41, 35, 30}, and {20, 15, 12, 8}. In this way, the tag size of each category inside one group is close to the others. During the ensemble sampling, all categories in the same group can achieve the accuracy requirement with very close finishing time; very few slots are wasted due to waiting for those, comparatively speaking, minor categories. On the other hand, these categories are put together with an appropriate granularity for ensemble sampling, as to sufficiently amortize the fixed time cost for each query cycle. D. The Ensemble Sampling-based Algorithm In Algorithm 1, we propose an ensemble sampling-based algorithm for the basic histogram collection. In the beginning, as the overall number of tags 𝑛 cannot be predicted, in order to accomodate a large operating range up to 𝑛¯, we need to set the initial frame size 𝑓 by solving 𝑓𝑒−𝑛/𝑓 ¯ = 5, as suggested in [4]. Then, during each cycle of ensemble sampling, we find the category with the largest population 𝜐 in the singleton slots, and set a threshold 𝑛𝑠,𝑖 > 𝜐 ⋅ 𝜃(0 <𝜃< 1) to filter out those minor categories which occasionally occupy a small number of singleton slots. For example, suppose it is observed from the singleton slots that the number of slots occupied by various categories are as follows: {35, 25, 10, 5, 3, 1}, if 𝜃 is set to 0.1, then the categories with the number of slots equal to 3 and 1 are filtered out from the next ensemble sampling. Therefore, during the ensemble sampling, we can avoid estimating tag sizes for those minor categories with a rather large variance. Then, the involved categories are further divided into smaller groups based on the dynamic programming. Therefore, as those major categories are estimated and wiped out from the set 𝑅 phase by phase, all categories, including the relatively minor categories, can be accurately estimated in terms of tag size. The query cycles continue to be issued until no singleton slots or collision slots exist. VI. ENSEMBLE SAMPLING FOR THE ICEBERG QUERY A. Motivation In some applications, the users only pay attention to the major categories with the tag sizes above a certain threshold 𝑡, while those minor categories are not necessarily addressed. Then, the iceberg query [16] is utilized to filter out those categories below the threshold 𝑡 in terms of the tag size. In this situation, the separate counting scheme is especially not suitable, since most of the categories are not within the scope of the concern, which can be wiped out together immediately. According to the definition in the problem formulation, three constraints for the iceberg query must be satisfied: The first constraint is the accuracy constraint, while the second and third constraints are the population constraints. In regard to Algorithm 1 Algorithm for histogram collection 1: Initialize the set 𝑅 to all tags. Set 𝑙 = 1. 2: while 𝑛𝑠 ∕= 0 ∧ 𝑛𝑐 ∕= 0 do 3: If 𝑙 = 1, compute the initial frame size 𝑓 by solving 𝑓𝑒−𝑛/𝑓 ¯ = 5. Otherwise, compute the frame size 𝑓 = 𝑛ˆ. If 𝑓>𝑓𝑚𝑎𝑥, set 𝑓 = 𝑓𝑚𝑎𝑥. 4: Set 𝑆 to ∅. Select the tags in 𝑅 and issue a query cycle with the frame size 𝑓, get 𝑛0, 𝑛𝑐, 𝑛𝑠. Find the category with the largest population 𝜐 in the singleton slots. For each category which appears in the singleton slot with population 𝑛𝑠,𝑖 > 𝜐 ⋅ 𝜃(𝜃 is constant, 0 <𝜃< 1), add it to the set 𝑆. 5: Estimate the tag size 𝑛𝑖 for each category 𝐶𝑖 ∈ 𝑆 using the SE estimator. Compute the variances 𝛿′ 𝑖 for each category 𝐶𝑖 ∈ 𝑆 according to Eq. (7). 6: Rank the categories in 𝑆 in non-increasing order of the tag size. Divide the set 𝑆 into groups 𝑆1, 𝑆2, ..., 𝑆𝑑 according to the dynamic programming-based method. 7: for each 𝑆𝑗 ∈ 𝑆(1 ≤ 𝑗 ≤ 𝑑) do 8: For each category 𝐶𝑖 ∈ 𝑆𝑗 , compute the frame size 𝑓𝑖 from 𝛿𝑖 by solving 1 1/𝛿′ 𝑖+1/𝛿𝑖 ≤ ( 𝜖 𝑍1−𝛽/2 )2 ⋅ 𝑛ˆ𝑖 2 . 9: Obtain the maximum frame size 𝑓 = max𝐶𝑖∈𝑆𝑗 𝑓𝑖. If 𝑓<𝑓𝑚𝑎𝑥, select all categories in 𝑆𝑗 , and issue a query cycle with frame size 𝑓. Otherwise, select all categories in 𝑆𝑗 , and issue 𝑟 query cycles with the frame size 𝑓𝑚𝑎𝑥. Wipe out the categories with satisfied accuracy after each query cycle. 10: Estimate the tag size 𝑛ˆ𝑖 for each category 𝐶𝑖 ∈ 𝑆𝑗 , illustrate them in the histogram. 11: end for 12: 𝑛ˆ = 𝑛ˆ − ∑ 𝐶𝑖∈𝑆 𝑛ˆ𝑖. 𝑅 = 𝑅 − 𝑆. 𝑆 = ∅. 𝑙 = 𝑙 + 1. 13: end while the accuracy constraint, we have demonstrated in Theorem 1 that it can be expressed in the form of the variance constraint. In regard to the population constraint, the second constraint infers that, in the results of the iceberg query, the false negative probability should be no more than 𝛽, while the third constraint infers that the false positive probability should be no more than 𝛽. We rely on the following theorem to express the population constraint in another equivalent form. Theorem 3: The two population constraints, 𝑃 𝑟[𝑛ˆ𝑖 < 𝑡∣𝑛𝑖 ≥ 𝑡] < 𝛽 and 𝑃 𝑟[𝑛ˆ𝑖 ≥ 𝑡∣𝑛𝑖 < 𝑡] < 𝛽, are satisfied as long as the standard deviation of the averaged tag size 𝜎𝑖 ≤ ∣𝑛𝑖−𝑡∣ Φ−1(1−𝛽) , Φ(𝑥) is the cumulative distribution function of the standard normal distribution. Due to lack of space, we omit the proof of Theorem 3. The detailed proof is given in [17]. In order to better illustrate the inherent principle, Fig.1 shows an example of the histogram with the 1−𝛽 confidence interval annotated, the 𝑦-axis denotes the estimated tag size for each category. In order to accurately verify the population constraint, it is required that the variance of the estimated tag size should be small enough. Note that when the 1 − 𝛽 confidence interval

点击进入文档下载页（PDF格式）

共9页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录