Building and Environment 102(2016)179-192 Contents lists available at ScienceDirect Building and Environment ELSEVIER journal homepage:www.elsevier.com/locate/buildenv Occupancy data analytics and prediction:A case study CrossMark Xin Liang a.b,Tianzhen Hong b."Geoffrey Qiping Shen a Department of Building and Real Estate,Hong Kong Polytechnic University.Hong Kong.China Building Technology and Urban Systems Division.Lawrence Berkeley National Labortory.Berkeley.CA 94720.USA ARTICLE INFO ABSTRACT Article history: Occupants are a critical impact factor of building energy consumption.Numerous previous studies Received 6 January 2016 emphasized the role of occupants and investigated the interactions between occupants and buildings. Received in revised form However,a fundamental problem,how to learn occupancy patterns and predict occupancy schedule,has 12 March 2016 not been well addressed due to highly stochastic activities of occupants and insufficient data.This study Accepted 25 March 2016 Available online 28 March 2016 proposes a data mining based approach for occupancy schedule learning and prediction in office buildings.The proposed approach first recognizes the patterns of occupant presence by cluster analysis Keywords: then learns the schedule rules by decision tree,and finally predicts the occupancy schedules based on the Occupancy prediction inducted rules.A case study was conducted in an office building in Philadelphia,U.S.Based on one-year Occupant presence observed data,the validation results indicate that the proposed approach significantly improves the Data mining accuracy of occupancy schedule prediction.The proposed approach only requires simple input data(ie.. Machine learning the time series data of occupant number entering and exiting a building).which is available in most office buildings.Therefore,this approach is practical to facilitate occupancy schedule prediction,building energy simulation and facility operation. 2016 Elsevier Ltd.All rights reserved. 1.Introduction occupant behavior was changed [6-8].Masoso and Grobler [7 indicated that more energy is used during non-working hours Buildings are responsible for the majority of energy consump- (56%)than during working hours (44%).mainly due to occupants tion and greenhouse gas(GHG)emissions around the world.In the leaving lights and equipment on at the end of the day.More studies United States(U.S.)buildings consume approximately 40%of the proved that different occupant behaviors can affect more than 40% total primary energy [1];while in Europe,the ratio is also about 40% of energy consumption in office buildings [9.10].Azar and Menassa [2.In the last few decades,building energy consumption has [6]opined energy conservation events,which improve energy continued to increase,especially in developing countries.In China, saving behaviors,can save 16%of electricity in the building. building energy consumption increased by more than 10%annually Occupant behavior is likewise a critical impact factor of energy [3.Large-scale commercial buildings have high energy use in- simulation and prediction for office buildings.Numerous simula- tensity,which can be up to 300 kWh/m2 and 5-15 times of that in tion models and platforms have been developed and are widely residential buildings [4].Office buildings accounted for approxi- used to predict building energy consumption during the design. mately 17%of the energy use in the U.S.commercial building sector operation and retrofit phases.However,the differences between 5].Therefore,office buildings play an important role in total en- real energy consumption and estimated value are typically more ergy consumption around the world. than 30%11.In some extreme cases,the difference can reach 100% Occupant behavior is considered a critical impact factor of en- 12].The International Energy Agency's Energy in the Buildings and ergy consumption in office buildings.Numerous previous studies Communities Program (EBC)Annex 53:"Total Energy Use in emphasize the role that occupants play in influencing the energy Buildings:Analysis Evaluation Methods"identified six driving consumption in buildings and the expected energy savings if factors of energy use in buildings:(1)climate,(2)building enve- lope,(3)building energy and services systems,(4)indoor design criteria,(5)building operation and maintenance,and(6)occupant behavior.While the first five factors have been well addressed,the Corresponding author. uncertainty of occupant presence and variation of occupant E-mail addresses:xin.cliang@connectpolyu.hk (X.Liang).thong@lblgov (T.Hong) behavior are considered main reasons of prediction deviations http://dx.doi.org/10.1016/j.buildenv.2016.03.027 0360-1323/0 2016 Elsevier Ltd.All rights reserved
Occupancy data analytics and prediction: A case study Xin Liang a, b , Tianzhen Hong b, * , Geoffrey Qiping Shen a a Department of Building and Real Estate, Hong Kong Polytechnic University, Hong Kong, China b Building Technology and Urban Systems Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA article info Article history: Received 6 January 2016 Received in revised form 12 March 2016 Accepted 25 March 2016 Available online 28 March 2016 Keywords: Occupancy prediction Occupant presence Data mining Machine learning abstract Occupants are a critical impact factor of building energy consumption. Numerous previous studies emphasized the role of occupants and investigated the interactions between occupants and buildings. However, a fundamental problem, how to learn occupancy patterns and predict occupancy schedule, has not been well addressed due to highly stochastic activities of occupants and insufficient data. This study proposes a data mining based approach for occupancy schedule learning and prediction in office buildings. The proposed approach first recognizes the patterns of occupant presence by cluster analysis, then learns the schedule rules by decision tree, and finally predicts the occupancy schedules based on the inducted rules. A case study was conducted in an office building in Philadelphia, U.S. Based on one-year observed data, the validation results indicate that the proposed approach significantly improves the accuracy of occupancy schedule prediction. The proposed approach only requires simple input data (i.e., the time series data of occupant number entering and exiting a building), which is available in most office buildings. Therefore, this approach is practical to facilitate occupancy schedule prediction, building energy simulation and facility operation. © 2016 Elsevier Ltd. All rights reserved. 1. Introduction Buildings are responsible for the majority of energy consumption and greenhouse gas (GHG) emissions around the world. In the United States (U.S.), buildings consume approximately 40% of the total primary energy [1]; while in Europe, the ratio is also about 40% [2]. In the last few decades, building energy consumption has continued to increase, especially in developing countries. In China, building energy consumption increased by more than 10% annually [3]. Large-scale commercial buildings have high energy use intensity, which can be up to 300 kWh/m2 and 5e15 times of that in residential buildings [4]. Office buildings accounted for approximately 17% of the energy use in the U.S. commercial building sector [5]. Therefore, office buildings play an important role in total energy consumption around the world. Occupant behavior is considered a critical impact factor of energy consumption in office buildings. Numerous previous studies emphasize the role that occupants play in influencing the energy consumption in buildings and the expected energy savings if occupant behavior was changed [6e8]. Masoso and Grobler [7] indicated that more energy is used during non-working hours (56%) than during working hours (44%), mainly due to occupants leaving lights and equipment on at the end of the day. More studies proved that different occupant behaviors can affect more than 40% of energy consumption in office buildings [9,10]. Azar and Menassa [6] opined energy conservation events, which improve energy saving behaviors, can save 16% of electricity in the building. Occupant behavior is likewise a critical impact factor of energy simulation and prediction for office buildings. Numerous simulation models and platforms have been developed and are widely used to predict building energy consumption during the design, operation and retrofit phases. However, the differences between real energy consumption and estimated value are typically more than 30% [11]. In some extreme cases, the difference can reach 100% [12]. The International Energy Agency's Energy in the Buildings and Communities Program (EBC) Annex 53: “Total Energy Use in Buildings: Analysis & Evaluation Methods” identified six driving factors of energy use in buildings: (1) climate, (2) building envelope, (3) building energy and services systems, (4) indoor design criteria, (5) building operation and maintenance, and (6) occupant behavior. While the first five factors have been well addressed, the uncertainty of occupant presence and variation of occupant behavior are considered main reasons of prediction deviations * Corresponding author. E-mail addresses: xin.c.liang@connect.polyu.hk (X. Liang), thong@lbl.gov (T. Hong). Contents lists available at ScienceDirect Building and Environment journal homepage: www.elsevier.com/locate/buildenv http://dx.doi.org/10.1016/j.buildenv.2016.03.027 0360-1323/© 2016 Elsevier Ltd. All rights reserved. Building and Environment 102 (2016) 179e192
180 X Liang et aL Building and Environment 102 (2016)179-192 12.131. power of data mining methods in recognizing pattern of occupant Owing to the significant impacts on energy consumption and behavior and energy consumption areas,but the research area of prediction in buildings,a number of studies focused on the occu- occupancy schedule leaning and predicting still needs exploration. pant's energy use characteristics,which is defined as the presence The aim of this study is to present a new approach for occupancy of occupants in the building and their actions to (or do not to)in- schedule learning and predicting in office buildings by using data fluence the energy consumption [14].D'Oca and Hong [15] mining based methods.The process of this study includes recog- observed and identified the patterns of window opening and nizing the patterns of occupant presence,summarizing the rules of closing behavior in an office building.Zhou et al.[16]analyzed the recognized patterns and finally predicting the occupancy lighting behavior in large office buildings based on a stochastic schedules.This study hypothesizes the identified patterns and rules model.Zhang et al.[17]simulated occupant movement,light and by the proposed data mining approach are right.Namely,they can equipment use behavior synthetically with agent-based models. present the true characteristics of the occupancy data.This hy- Sun et al.[18]investigated the impact of overtime working on pothesis is validated by comparing the accuracy of prediction be- energy consumption in an office building.Azar and Menassa 16 tween the proposed method and the traditional methods.If the showed the education and learning effect of energy saving accuracy of the prediction results is improved,it indicates the hy- behavior,and proposed the impacts of energy conservation pro- pothesis is true. motion on energy saving. This model only needs a few types of inputs,typically the time Before modelling occupant's energy use characteristics,there is series data of occupant number entering and exiting a building. a more essential research question:how to identify the pattern of Another advantage of this model is that it allows for relatively occupant presence and predict the occupancy schedule?Without simple operations,excluding probability distribution fitting and the answer to this question,the occupant's energy use character- other complex mathematical processing.That means this method istics cannot get down to the ground.However,due to the highly can be well adaptive to practical projects.The results of this study stochastic activities and insufficient data,it is difficult to observe are critical to provide insight into the pattern of occupant presence, and predict occupant presence.Previous studies did not pay facilitate the energy simulation and prediction as well as improve enough attention to occupancy schedule and this question has not energy saving operation and retrofit. been well addressed.In general,three typical methods were applied to model occupant presence in previous studies.First 2.Methodology method is fix schedules.Occupants are categorized into several groups (e.g.,early bird,timetable complier and flexible worker) 2.1.Framework of occupancy schedule learning and prediction then each group is assigned to a specific schedule [17].Combining the schedules of each group proportionally can generate the Traditional methods of transforming data to knowledge nor- schedule of the whole building.The second method assumes that mally used statistical tests,regression and curve fitting by a certain occupant presence satisfies a certain probability distribution.The probability distribution.These methods are effective when data is distribution can be Poisson distribution[16],binomial distribution small volume,accurate and standardized.However,when the vol- 18.uniform distribution and triangle distribution [19].The occu- ume of data is growing exponentially in recent years,these pancy schedule can be obtained by a virtual occupant generation methods become slow and expensive.More seriously,when there following the certain distribution.The third method is analyzing is considerable missing data,the deviated data or the data format is practical observation data.D'Oca and Hong 8 observed 16 private disunion (e.g.the time steps are different,mix of numbers and offices with single or dual occupancy and Wang et al.[20]observed words),these methods cannot be applied or cannot deduce satis- 35 offices with single occupancy. fied results.Data mining is an emerging method which can process Although these methods had advantages and improved occu- big data and unstructured data effectively and robustly.Machine pancy schedule modeling,there are still some limitations:(1)the learning.as a main method of data mining.is specifically good at assumptions are not solid.Occupancy schedule is highly stochastic, identifying patterns and inducting rules.Since this study includes it is inappropriate to simply define that occupants belong to a huge volume of data and aims to induct rules of occupancy certain group or follow a certain distribution;(2)the previous schedules,data mining is selected as the research method. research emphasized on summarizing rules of occupant presence, Data mining,which is also named knowledge discovery in da- but less attention has been paid to predicting schedules in future tabases (KDD).is a relatively young and interdisciplinary field of The results are not practical if they cannot guide future work;(3) computer science.It is the process of discovering new patterns the results of schedules lack validation with real data;(4)observed from large data sets,involving methods at the intersection of data mainly focused on a single or multiple offices,so the data are pattern recognition,machine learning,artificial intelligence,cloud limited and results may be biased if applied to the whole building architecture,and data visualization [27.Normally,the process of To bridge the aforementioned research gaps,this study proposes KDD involves six steps:(1)Data selection;(2)Data cleaning and a data mining based approach to learning and predicting occupancy preprocessing:(3)Data transformation;(4)Data mining:(5)Data schedule for the whole building.Data mining can be defined as: interpretation and evaluation;and(6)Knowledge extraction 8]. "The analysis of large observation data sets to find unsuspected This study proposes a data mining based approach to discover relationships and to summarize the data in novel ways so that occupancy schedule patterns and extrapolate occupancy schedule owners can fully understand and make use of the data"[21.Data from observed big data streams of a building.The framework of this mining methods have significant advantages in revealing under- proposed method includes six steps,illustrated in Fig.1. lying patterns of data,which has been widely used in various Step 1:problem framing.The first step is to clarify problem research and industry fields,such as marketing.biology.engi- definition,boundary,assumption and key metric of success.The neering and social science [22].However,the applications of data research problem is defined as how to predict occupancy schedule mining in occupancy schedule and building energy consumption is from historical observed data.The scope of this study focuses on still underdeveloped.Some previous studies applied data mining the schedule prediction for weekdays in office buildings.The key methods to discover the pattern of occupant behavior [15,23,24]. metric of success is the similarity of prediction results to the and others focused on interactions between occupants and energy observed data. consumption [8,25,26].These studies demonstrated the strong Step 2:data acquisition and preparation.The second step is to
[12,13]. Owing to the significant impacts on energy consumption and prediction in buildings, a number of studies focused on the occupant's energy use characteristics, which is defined as the presence of occupants in the building and their actions to (or do not to) in- fluence the energy consumption [14]. D'Oca and Hong [15] observed and identified the patterns of window opening and closing behavior in an office building. Zhou et al. [16] analyzed lighting behavior in large office buildings based on a stochastic model. Zhang et al. [17] simulated occupant movement, light and equipment use behavior synthetically with agent-based models. Sun et al. [18] investigated the impact of overtime working on energy consumption in an office building. Azar and Menassa [6] showed the education and learning effect of energy saving behavior, and proposed the impacts of energy conservation promotion on energy saving. Before modelling occupant's energy use characteristics, there is a more essential research question: how to identify the pattern of occupant presence and predict the occupancy schedule? Without the answer to this question, the occupant's energy use characteristics cannot get down to the ground. However, due to the highly stochastic activities and insufficient data, it is difficult to observe and predict occupant presence. Previous studies did not pay enough attention to occupancy schedule and this question has not been well addressed. In general, three typical methods were applied to model occupant presence in previous studies. First method is fix schedules. Occupants are categorized into several groups (e.g., early bird, timetable complier and flexible worker), then each group is assigned to a specific schedule [17]. Combining the schedules of each group proportionally can generate the schedule of the whole building. The second method assumes that occupant presence satisfies a certain probability distribution. The distribution can be Poisson distribution [16], binomial distribution [18], uniform distribution and triangle distribution [19]. The occupancy schedule can be obtained by a virtual occupant generation following the certain distribution. The third method is analyzing practical observation data. D'Oca and Hong [8] observed 16 private offices with single or dual occupancy and Wang et al. [20] observed 35 offices with single occupancy. Although these methods had advantages and improved occupancy schedule modeling, there are still some limitations: (1) the assumptions are not solid. Occupancy schedule is highly stochastic, it is inappropriate to simply define that occupants belong to a certain group or follow a certain distribution; (2) the previous research emphasized on summarizing rules of occupant presence, but less attention has been paid to predicting schedules in future. The results are not practical if they cannot guide future work; (3) the results of schedules lack validation with real data; (4) observed data mainly focused on a single or multiple offices, so the data are limited and results may be biased if applied to the whole building. To bridge the aforementioned research gaps, this study proposes a data mining based approach to learning and predicting occupancy schedule for the whole building. Data mining can be defined as: “The analysis of large observation data sets to find unsuspected relationships and to summarize the data in novel ways so that owners can fully understand and make use of the data” [21]. Data mining methods have significant advantages in revealing underlying patterns of data, which has been widely used in various research and industry fields, such as marketing, biology, engineering and social science [22]. However, the applications of data mining in occupancy schedule and building energy consumption is still underdeveloped. Some previous studies applied data mining methods to discover the pattern of occupant behavior [15,23,24], and others focused on interactions between occupants and energy consumption [8,25,26]. These studies demonstrated the strong power of data mining methods in recognizing pattern of occupant behavior and energy consumption areas, but the research area of occupancy schedule leaning and predicting still needs exploration. The aim of this study is to present a new approach for occupancy schedule learning and predicting in office buildings by using data mining based methods. The process of this study includes recognizing the patterns of occupant presence, summarizing the rules of the recognized patterns and finally predicting the occupancy schedules. This study hypothesizes the identified patterns and rules by the proposed data mining approach are right. Namely, they can present the true characteristics of the occupancy data. This hypothesis is validated by comparing the accuracy of prediction between the proposed method and the traditional methods. If the accuracy of the prediction results is improved, it indicates the hypothesis is true. This model only needs a few types of inputs, typically the time series data of occupant number entering and exiting a building. Another advantage of this model is that it allows for relatively simple operations, excluding probability distribution fitting and other complex mathematical processing. That means this method can be well adaptive to practical projects. The results of this study are critical to provide insight into the pattern of occupant presence, facilitate the energy simulation and prediction as well as improve energy saving operation and retrofit. 2. Methodology 2.1. Framework of occupancy schedule learning and prediction Traditional methods of transforming data to knowledge normally used statistical tests, regression and curve fitting by a certain probability distribution. These methods are effective when data is small volume, accurate and standardized. However, when the volume of data is growing exponentially in recent years, these methods become slow and expensive. More seriously, when there is considerable missing data, the deviated data or the data format is disunion (e.g. the time steps are different, mix of numbers and words), these methods cannot be applied or cannot deduce satis- fied results. Data mining is an emerging method which can process big data and unstructured data effectively and robustly. Machine learning, as a main method of data mining, is specifically good at identifying patterns and inducting rules. Since this study includes huge volume of data and aims to induct rules of occupancy schedules, data mining is selected as the research method. Data mining, which is also named knowledge discovery in databases (KDD), is a relatively young and interdisciplinary field of computer science. It is the process of discovering new patterns from large data sets, involving methods at the intersection of pattern recognition, machine learning, artificial intelligence, cloud architecture, and data visualization [27]. Normally, the process of KDD involves six steps: (1) Data selection; (2) Data cleaning and preprocessing; (3) Data transformation; (4) Data mining; (5) Data interpretation and evaluation; and (6) Knowledge extraction [8]. This study proposes a data mining based approach to discover occupancy schedule patterns and extrapolate occupancy schedule from observed big data streams of a building. The framework of this proposed method includes six steps, illustrated in Fig. 1. Step 1: problem framing. The first step is to clarify problem definition, boundary, assumption and key metric of success. The research problem is defined as how to predict occupancy schedule from historical observed data. The scope of this study focuses on the schedule prediction for weekdays in office buildings. The key metric of success is the similarity of prediction results to the observed data. Step 2: data acquisition and preparation. The second step is to 180 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 Steps Methods/Tools Outcomes 1 Problem statement. Literature review; Problem Framing assumption and key Expert interview metrics Acquire and Acquire,harmonize, rescale.clean and Valid data Prepare Data format data Methodology Identify problem Selected Selection solving approaches approaches and and software software tools Patterns and rules Learning Machine learning: Rule Induction; of occupancy schedule Prediction method Results of Prediction based on occupancy occupancy pattern presence prediction Compare prediction Validation results to observed Effect of the data proposed method Fig.1.Framework of the proposed method for occupancy schedule learning and predicting. acquire,harmonize,rescale,clean and format data.Due to the modularized operation for analytics and data mining.Due to its failure of sensors and other interference factors,the raw data may flexibility and accessibility,RapidMiner has been widely used in contain missing data,error data and the unstructured data.Before industry and academia. data mining.the raw data should be pre-processed to get the valid Step 4:learning.This step is to discover the patterns of occu- data.In this study,the missing data is removed from the data set. pancy schedule and abstract the rules within the patterns.Clus- Statistical methods (ie.,box plot and mean value)are used to tering and decision tree are applied for pattern recognition and rule investigate the characteristics of the data before data mining. induction respectively.The details of processes and results of each Step 3:methodology selection.Data mining involves various step are illustrated in the learning phase in Fig.2. kinds of methods.Different methods target problems at different Step 5:prediction.The observed data is split to a training set and levels.According to the specific problem and data source,appro- a test set.The training set is used to train the model and identify the priate methods could be selected.In this study,machine learning rules,shown in the predicting phase in Fig.2.Based on the iden- method is adopted to discover patterns of occupant presence,while tified patterns and rules of occupant presence,the occupancy rule induction is used to summarize rules within the patterns. schedule can be predicted. Software selection is essential to analyze data.Matlab 2015 and Step 6:validation.This step is to compare the prediction result RapidMiner 6.5 are applied on a standard PC with Windows 7 to to the test data set,shown in the validating phase in Fig.2.The perform the data processing and data mining.respectively.Rapid- more similar the two sets are,the better the method is.To quan- Miner is open source software with visualized interface and titatively validate the proposed method,several metrics can be
acquire, harmonize, rescale, clean and format data. Due to the failure of sensors and other interference factors, the raw data may contain missing data, error data and the unstructured data. Before data mining, the raw data should be pre-processed to get the valid data. In this study, the missing data is removed from the data set. Statistical methods (i.e., box plot and mean value) are used to investigate the characteristics of the data before data mining. Step 3: methodology selection. Data mining involves various kinds of methods. Different methods target problems at different levels. According to the specific problem and data source, appropriate methods could be selected. In this study, machine learning method is adopted to discover patterns of occupant presence, while rule induction is used to summarize rules within the patterns. Software selection is essential to analyze data. Matlab 2015 and RapidMiner 6.5 are applied on a standard PC with Windows 7 to perform the data processing and data mining, respectively. RapidMiner is open source software with visualized interface and modularized operation for analytics and data mining. Due to its flexibility and accessibility, RapidMiner has been widely used in industry and academia. Step 4: learning. This step is to discover the patterns of occupancy schedule and abstract the rules within the patterns. Clustering and decision tree are applied for pattern recognition and rule induction respectively. The details of processes and results of each step are illustrated in the learning phase in Fig. 2. Step 5: prediction. The observed data is split to a training set and a test set. The training set is used to train the model and identify the rules, shown in the predicting phase in Fig. 2. Based on the identified patterns and rules of occupant presence, the occupancy schedule can be predicted. Step 6: validation. This step is to compare the prediction result to the test data set, shown in the validating phase in Fig. 2. The more similar the two sets are, the better the method is. To quantitatively validate the proposed method, several metrics can be Methods/Tools Problem statement, assumption and key metrics Steps Problem Framing 1 Acquire and Prepare Data 2 Methodology Selection 3 Learning 4 Prediction 5 Literature review; Expert interview Acquire, harmonize, rescale, clean and format data Identify problem solving approaches and software Machine learning; Rule Induction; Prediction method based on occupancy pattern Outcomes Valid data Selected approaches and software tools Patterns and rules of occupancy schedule Results of occupancy presence prediction Validation 6 Compare prediction results to observed data Effect of the proposed method Fig. 1. Framework of the proposed method for occupancy schedule learning and predicting. X. Liang et al. / Building and Environment 102 (2016) 179e192 181
182 X Liang et aL Building and Environment 102 (2016)179-192 Phase Process Results Start Clustering No Clusters Patterns of occupant acceptable? presence Learning Yes Decision Tree Training No Accuracy Rules of patterns acceptable? Applying Rules Predicting Observed data☐ Training and Predicting Splitting Comparing Training set Test set raining Validating No Prediction Accuracy acceptable? Comparing Yes End Evaluation of Method Fig.2.Processes of the proposed method and results. applied to measure similarity between prediction results and Y to train the function(X).The goal of unsupervised learning is observed data,including mean,median,bias,RMSE (root mean to discover hidden patterns in the input data x by its own features, squared error)and RTE (relative total error).The details of the shown in Fig.3(b).In reality,numerous problems cannot obtain metrics and validation will be introduced in Section 3.5. priori information of outputs.Therefore,unsupervised learning is widely used to solve this kind of problems recently. 2.2.Machine learning This study uses both the supervised learning and the unsuper- vised learning in two steps.At the beginning,there is no label of Machine learning is an important method of data mining[27]. occupancy schedule data,so the unsupervised learning method which allows computers to learn from and make predictions on (i.e..clustering)is applied to identify patterns of occupant presence data via observation,experience,analysis and self-training [27,28]. from the features of data.After that,the presence data have labels, It operates by building a model to make data-driven predictions or which are the identified patterns.Then,the supervised learning decisions,rather than following strictly static program instructions method (i.e.,decision tree)is applied to induct rules based on the 291. labeled data. There are two types of machine learning.namely supervised learning and unsupervised learning [30.The former one refers to 2.2.1.Cluster analysis the traditional learning methods with training data,which is a Cluster analysis is a typical unsupervised machine learning known labeled data set of inputs and outputs.As a standard su- method,which aims to group data into a few cohesive clusters [31]. pervised learning problem, training samples The criterion of clustering is the similarities among samples.The (X,Y)={(x1,y).....(x.y)}are offered for an unknown function samples should have high similarities within the same cluster but Y=(X).X denotes the "input"variables,also called input fea- low similarities in different clusters.The similarity is normally tures,and Y denotes the "output"or target variables that trying to measured by distance.The shorter the distance between samples is, predict.The xi values are typically vectors of the form the more similar the samples are.There are various distance defi- (x,1,x2.....Xin)which are the features of xi.such as weight,color, nitions,including the Euclidian distance,the Chebyshev distance, shape and so on.The notation xij refers to the j-th feature of xi.The the Hamming distance,the dynamic time wrap distance and the goal of supervised learning is to learn a general rule(x)that correlation distance [32].Appropriate distance type should be maps inputs X to outputs Y,shown in Fig.3(a).The typical algo- selected according to the specific problem.For example,The rithms of supervised learning include regression,Bayesian statistic, Euclidian distance is commonly used for the direct geometrical decision tree and etc. distance.The correlation distance is good at triangle similarity.The The unsupervised learning refers to the methods without given dynamic time wrap is commonly used for the similarity of time- labels to the learning algorithm.leaving it on its own to find shift sequences.This study compares three kinds of distances, structure in its input.In unsupervised learning.there is no"output" shown in Fig.10,and selects the Euclidian distance due to its best
applied to measure similarity between prediction results and observed data, including mean, median, bias, RMSE (root mean squared error) and RTE (relative total error). The details of the metrics and validation will be introduced in Section 3.5. 2.2. Machine learning Machine learning is an important method of data mining [27], which allows computers to learn from and make predictions on data via observation, experience, analysis and self-training [27,28]. It operates by building a model to make data-driven predictions or decisions, rather than following strictly static program instructions [29]. There are two types of machine learning, namely supervised learning and unsupervised learning [30]. The former one refers to the traditional learning methods with training data, which is a known labeled data set of inputs and outputs. As a standard supervised learning problem, training samples ðX; YÞ ¼ fðx1; y1Þ;…;ðxm ; ym Þg are offered for an unknown function Y ¼ F ðXÞ: X denotes the “input” variables, also called input features, and Y denotes the “output” or target variables that trying to predict. The xi values are typically vectors of the form ðxi 1; xi 2; …; xinÞwhich are the features of xi, such as weight, color, shape and so on. The notation xij refers to the j-th feature of xi. The goal of supervised learning is to learn a general rule F ðXÞ that maps inputs X to outputs Y, shown in Fig. 3 (a). The typical algorithms of supervised learning include regression, Bayesian statistic, decision tree and etc. The unsupervised learning refers to the methods without given labels to the learning algorithm, leaving it on its own to find structure in its input. In unsupervised learning, there is no “output” Y to train the function F ðXÞ. The goal of unsupervised learning is to discover hidden patterns in the input data X by its own features, shown in Fig. 3 (b). In reality, numerous problems cannot obtain priori information of outputs. Therefore, unsupervised learning is widely used to solve this kind of problems recently. This study uses both the supervised learning and the unsupervised learning in two steps. At the beginning, there is no label of occupancy schedule data, so the unsupervised learning method (i.e., clustering) is applied to identify patterns of occupant presence from the features of data. After that, the presence data have labels, which are the identified patterns. Then, the supervised learning method (i.e., decision tree) is applied to induct rules based on the labeled data. 2.2.1. Cluster analysis Cluster analysis is a typical unsupervised machine learning method, which aims to group data into a few cohesive clusters [31]. The criterion of clustering is the similarities among samples. The samples should have high similarities within the same cluster but low similarities in different clusters. The similarity is normally measured by distance. The shorter the distance between samples is, the more similar the samples are. There are various distance defi- nitions, including the Euclidian distance, the Chebyshev distance, the Hamming distance, the dynamic time wrap distance and the correlation distance [32]. Appropriate distance type should be selected according to the specific problem. For example, The Euclidian distance is commonly used for the direct geometrical distance. The correlation distance is good at triangle similarity. The dynamic time wrap is commonly used for the similarity of timeshift sequences. This study compares three kinds of distances, shown in Fig. 10, and selects the Euclidian distance due to its best Fig. 2. Processes of the proposed method and results. 182 X. Liang et al. / Building and Environment 102 (2016) 179e192
X.Liang et al.Building and Environment 102(2016)179-192 183 Goal of Taraets Learning y1y2,…,yn) Inputs Rule: Outputs 十 Compare (&1,2,…,Xn)1 Y=F(X) (12,…,n) Adjust (a)Supervised learning Goal of Learning Inputs Optimization Outputs (1,X2,…,Xn) Algorithm Patterns of Inputs Adjust (b)Unsupervised learning Fig.3.Mechanism of machine learning. performance. One operation is assigning each training sample xi to the closest There are various clustering models,and for each of these cluster centroid uj,shown in Eq.(1).The other one is moving each models,different algorithms can be given [33].Typical cluster cluster centroid uj to the mean of the points assigned to it,shown in models include connectivity based models (e.g.,hierarchical clus- Eq.(2). tering),centroid based models(e.g..k-means clustering).distribu- The appropriate clustering algorithm for a particular problem tion based models(e.g..Gaussian distributions fitting)and density needs to be chosen experimentally,since there is no defined "best" based models(e.g.Density-based spatial clustering of applications clustering algorithm [33.The most appropriate algorithm for a with noise)[34].Among numerous clustering algorithms,the k- certain problem can be selected by its performance.The perfor- means clustering is the most commonly used,which is defined as mance of algorithms can be measured by the definition of clusters, follows. namely the proportion of intra-cluster distance to inter-cluster distance.The Davies-Bouldin index (DBI)is used to evaluate 1.Initialize cluster centroids u.u2.....uk ER different methods in this study.This index is defined in Eq.(3). 2.Repeat until convergence:{ For every j,set -若) (3) argmin, (1) where n is the number of clusters,ci is the centroid of cluster i,o;is the average distance of all elements in cluster i to centroid ci,and For every i,set d(ci.ci)is the distance between centroids c and c;.The lower value of DBI means lower intra-cluster distances (higher intra-cluster sim- 〔1f=j =1 1={0fj (2) ilarity)and higher inter-cluster distances(lower inter-cluster sim- ilarity).therefore,the clustering algorithm with the smallest DBI is considered the best algorithm based on this criterion. In the k-means algorithm,k(a parameter of the algorithm)is the 2.2.2.Decision tree learning preset number of clusters.The cluster centroids ui represent the This study uses decision tree to induce the rules of occupant positions of the centers of the clusters.Step 1 is to initialize cluster presence.Decision tree learning is a typical supervised machine centroids,randomly or by a specific method.Step 2 is to find learning algorithm in data mining [35.It uses a tree-like structure optimal cluster centroids and samples assigned to them.Two op- to model the rules and their possible consequences.A main erations are implemented iteratively until convergence in this step. advantage of decision tree method is that it can represent the rules
performance. There are various clustering models, and for each of these models, different algorithms can be given [33]. Typical cluster models include connectivity based models (e.g., hierarchical clustering), centroid based models (e.g., k-means clustering), distribution based models (e.g., Gaussian distributions fitting) and density based models (e.g., Density-based spatial clustering of applications with noise) [34]. Among numerous clustering algorithms, the kmeans clustering is the most commonly used, which is defined as follows. 1. Initialize cluster centroids m1, m2,…, mk 2ℝ 2. Repeat until convergence: { For every j, set ci ¼ argminj xi mj (1) For every i, set mj ¼ Pm Pi ¼1a,xi m i ¼1a ; a ¼ 1 if ci ¼ j 0 if ci sj (2) . In the k-means algorithm, k (a parameter of the algorithm) is the preset number of clusters. The cluster centroids mj represent the positions of the centers of the clusters. Step 1 is to initialize cluster centroids, randomly or by a specific method. Step 2 is to find optimal cluster centroids and samples assigned to them. Two operations are implemented iteratively until convergence in this step. One operation is assigning each training sample xi to the closest cluster centroid mj, shown in Eq. (1). The other one is moving each cluster centroid mj to the mean of the points assigned to it, shown in Eq. (2). The appropriate clustering algorithm for a particular problem needs to be chosen experimentally, since there is no defined “best” clustering algorithm [33]. The most appropriate algorithm for a certain problem can be selected by its performance. The performance of algorithms can be measured by the definition of clusters, namely the proportion of intra-cluster distance to inter-cluster distance. The Davies-Bouldin index (DBI) is used to evaluate different methods in this study. This index is defined in Eq. (3). DB ¼ 1 n Xn i¼1 max jsi si þ sj dðci; cjÞ ! (3) where n is the number of clusters, ci is the centroid of cluster i, si is the average distance of all elements in cluster i to centroid ci, and d(ci,cj) is the distance between centroids ci and cj. The lower value of DBI means lower intra-cluster distances (higher intra-cluster similarity) and higher inter-cluster distances (lower inter-cluster similarity), therefore, the clustering algorithm with the smallest DBI is considered the best algorithm based on this criterion. 2.2.2. Decision tree learning This study uses decision tree to induce the rules of occupant presence. Decision tree learning is a typical supervised machine learning algorithm in data mining [35]. It uses a tree-like structure to model the rules and their possible consequences. A main advantage of decision tree method is that it can represent the rules Rule: Inputs Outputs Adjust Compare Goal of Targets Learning (a) Supervised learning Goal of Learning Optimization Algorithm Inputs Outputs Adjust Patterns of Inputs (b) Unsupervised learning Fig. 3. Mechanism of machine learning. X. Liang et al. / Building and Environment 102 (2016) 179e192 183