Data reduction as it applies to variable selection is more complex. This is because variables to be studied must be selected and others discarded. This is typically done by individuals who are experts in the field Section 2.5 Review Questions 1. What is the relationship between statistics and business analytics? Statistics can be used as a part of business analytics, either to help generate reports or as a presentation for What are the main differences between descriptive and inferential statistics? Descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population 3. List and briefly define the central tendency measures of descriptive statistic Measures of centrality are the mathematical methods by which we estimate or describe central positioning of a given variable of interest. A measure of central endency is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data The arithmetic mean(or simply mean or average) is the sum of all the values/observations divided by the number of observations in the data set the midd le of a given set of data that has been arranged/sorted in order of Der in The med ian is the measure of center value in a given data set. It is the number in magnitude(either ascending or descend ing) The mode is the observation that occurs most frequently( the most frequent value in our data set) 4. List and briefly define the dispersion measures of descriptive statistics Measures of dispersion are the mathematical methods used to estimate or describe the degree of variation in a given variable of interest The range is the difference between the largest and the smallest values in a given data set (i.e, variables) Variance is a method used to calculate the deviation of all data points in a given data set from the mean 6 Copyright C2018 Pearson Education, Inc
6 Copyright © 2018Pearson Education, Inc. Data reduction as it applies to variable selection is more complex. This is because variables to be studied must be selected and others discarded. This is typically done by individuals who are experts in the field. Section 2.5 Review Questions 1. What is the relationship between statistics and business analytics? Statistics can be used as a part of business analytics, either to help generate reports or as a presentation format. 2. What are the main differences between descriptive and inferential statistics? Descriptive statistics is all about describing the sample data on hand, and inferential statistics is about drawing inferences or conclusions about the characteristics of the population. 3. List and briefly define the central tendency measures of descriptive statistics. Measures of centrality are the mathematical methods by which we estimate or describe central positioning of a given variable of interest. A measure of central tendency is a single numerical value that aims to describe a set of data by simply identifying or estimating the central position within the data. The arithmetic mean (or simply mean or average) is the sum of all the values/observations divided by the number of observations in the data set. The median is the measure of center value in a given data set. It is the number in the middle of a given set of data that has been arranged/sorted in order of magnitude (either ascending or descending). The mode is the observation that occurs most frequently (the most frequent value in our data set). 4. List and briefly define the dispersion measures of descriptive statistics. Measures of dispersion are the mathematical methods used to estimate or describe the degree of variation in a given variable of interest. The range is the difference between the largest and the smallest values in a given data set (i.e., variables). Variance is a method used to calculate the deviation of all data points in a given data set from the mean
The standard deviation is a measure of the spread of values within a set of data The standard deviation is calculated by simply taking the square root variations Mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and summing them Quartiles help us identify spread within a subset of the data. a quartile is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets 5. What is a box-and-whiskers plot? What types of statistical information does it epresent The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges 6. What are the two most commonly used shape characteristics to describe a data distribution Skewness is a measure of asymmetry in a distribution of the data that portrays a unimodal structure--only one peak exists in the distribution of the data. Kurtosis is another measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the Section 2.6 Review Questions What is regression, and what statistical purpose does it serve Regression is a relatively simple statistical technique to model the dependence of a variable(response or output variable) on one(or more)explanatory(input) What are the commonalities and differences between regression and correlation? Correlation makes no a priori assumption of whether one variable is dependent on the other(s)and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. On the other hand, regression attempts to describe the dependence of a response variable on one(or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s)to the response variable, regardless of whether the path of effect is d irect or indirect. Also, although correlation is interested in the low-level relationships between two variables Copyright C2018 Pearson Education, Inc
7 Copyright © 2018Pearson Education, Inc. The standard deviation is a measure of the spread of values within a set of data. The standard deviation is calculated by simply taking the square root of the variations. Mean absolute deviation is calculated by measuring the absolute values of the differences between each data point and the mean and summing them. Quartiles help us identify spread within a subset of the data. A quartile is a quarter of the number of data points given in a data set. Quartiles are determined by first sorting the data and then splitting the sorted data into four disjoint smaller data sets. 5. What is a box-and-whiskers plot? What types of statistical information does it represent? The box-and-whiskers plot is a graphical illustration of several descriptive statistics about a given data set. The box plot shows the centrality, the dispersion, and the minimum and maximum ranges. 6. What are the two most commonly used shape characteristics to describe a data distribution? Skewness is a measure of asymmetry in a distribution of the data that portrays a unimodal structure—only one peak exists in the distribution of the data. Kurtosis is another measure to use in characterizing the shape of a unimodal distribution that is more interested in characterizing the peak/tall/skinny nature of the distribution. Section 2.6 Review Questions 1. What is regression, and what statistical purpose does it serve? Regression is a relatively simple statistical technique to model the dependence of a variable (response or output variable) on one (or more) explanatory (input) variables. 2. What are the commonalities and differences between regression and correlation? Correlation makes no a priori assumption of whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate on the degree of association between the variables. On the other hand, regression attempts to describe the dependence of a response variable on one (or more) explanatory variables where it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. Also, although correlation is interested in the low-level relationships between two variables
regression is concerned with the relationships between all explanatory variables and the response variable 3. What is ols? How does olS determine the linear regression line? Ordinary least squares(OLS) method aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the egression line 4. List and describe the main steps to follow in developing a linear regression model First perform a quick assessment of the data through the use of a scatter plot and/or correlations. Next, perform model fitting by transforming the data into a more usable format and estimating any needed parameters. Third, model your assessment by testing assumptions and evaluating its fit. Finally, if the steps show that regression is warranted, deploy and calculate the regression 5 What are the most commonly pronounced assumptions for linear regression? The most commonly pronounced assumptions for linear regression include linearity, independence, normality, constant variance, and multicollinearity 6 What is logistics regression? How does it differ from linear regression? Logistics regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It differs from linear regression with one major point: its output(response variable) is a class as opposed to a numerical variable 7. What is time series? What are the main forecasting techniques for time series data? Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values Section 2.7 Review Questions 1. What is a report? What are reports used for? A report is any communication artifact prepared with the specific intention of conveying information in a presentable form to whoever needs it, whenever and wherever they may need it. It is usually a document that contains information usually driven from data and personal experiences)organized in a narrative, graphic, and/or tabular form, prepared periodically(recurring)or required(ad hoc)basis, referring to specific time periods, events, occurrences, or subjects Copyright C2018 Pearson Education, Inc
8 Copyright © 2018Pearson Education, Inc. regression is concerned with the relationships between all explanatory variables and the response variable. 3. What is OLS? How does OLS determine the linear regression line? Ordinary least squares (OLS) method aims to minimize the sum of squared residuals and leads to a mathematical expression for the estimated value of the regression line. 4. List and describe the main steps to follow in developing a linear regression model. First perform a quick assessment of the data through the use of a scatter plot and/or correlations. Next, perform model fitting by transforming the data into a more usable format and estimating any needed parameters. Third, model your assessment by testing assumptions and evaluating its fit. Finally, if the steps show that regression is warranted, deploy and calculate the regression. 5. What are the most commonly pronounced assumptions for linear regression? The most commonly pronounced assumptions for linear regression include linearity, independence, normality, constant variance, and multicollinearity. 6. What is logistics regression? How does it differ from linear regression? Logistics regression is a very popular, statistically sound, probability-based classification algorithm that employs supervised learning. It differs from linear regression with one major point: its output (response variable) is a class as opposed to a numerical variable. 7. What is time series? What are the main forecasting techniques for time series data? Time series forecasting is the use of mathematical modeling to predict future values of the variable of interest based on previously observed values. Section 2.7 Review Questions 1. What is a report? What are reports used for? A report is any communication artifact prepared with the specific intention of conveying information in a presentable form to whoever needs it, whenever and wherever they may need it. It is usually a document that contains information (usually driven from data and personal experiences) organized in a narrative, graphic, and/or tabular form, prepared periodically (recurring) or on an asrequired (ad hoc) basis, referring to specific time periods, events, occurrences, or subjects
What is a business report? What are the main characteristics of a good business a business report is a written document that contains information regard usiness matters. Business reporting(also called enterprise reporting)is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. The found ation of these reports is various sources of data coming from both inside and outside the organization Creation of these reports involves ETL(extract, transform, and load) procedures in coord ination with a data warehouse and then using one or more reporting tools While reports can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. Primary characteristics of a good business report include clarity, brevity, completeness, and correctness 3. Describe the cyclic process of management and comment on the role of business eports The cyclic process of management, as illustrated in Figure 2. 1, involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting(i.e, information generation) converting data from d ifferent sources into actionable information 4. List and describe the three major categories of business reports There are a wide variety of business reports, which for managerial purposes can be grouped into three major categories: metric management reports, dashboard type reports, and balanced scorecard-type reports Metric management reports involve outcome-oriented metrics based on service level agreements and/or key performance indicators. Dashboard-type reports present a range of performance indicators on one page, with both static/predefined elements and customizable wid gets and views. Balanced scorecard reports present an integrated view of a company's health and include financial, customer, business process, and learning/growth perspectives 5. What are the main components of a business reporting system? a business reporting system includes several components. One is the online transaction processing system(ERP, POS, etc. )that records transactions. A second is a data supply that takes recorded events and transactions and delivers them to the reporting system. Next comes an EtL component that ensures quality and performs necessary transformations prior to load ing the data into a data store Then there is the data storage itself (such as a data warehouse ). Business logic converts the data into the reporting outputs. Publication distributes or hosts the reports for end users. And finally assurance provides a quality control check on the reports and their dissemination Copyright C2018 Pearson Education, Inc
9 Copyright © 2018Pearson Education, Inc. 2. What is a business report? What are the main characteristics of a good business report? A business report is a written document that contains information regarding business matters. Business reporting (also called enterprise reporting) is an essential part of the larger drive toward improved managerial decision making and organizational knowledge management. The foundation of these reports is various sources of data coming from both inside and outside the organization. Creation of these reports involves ETL (extract, transform, and load) procedures in coordination with a data warehouse and then using one or more reporting tools. While reports can be distributed in print form or via e-mail, they are typically accessed via a corporate intranet. Primary characteristics of a good business report include clarity, brevity, completeness, and correctness. 3. Describe the cyclic process of management and comment on the role of business reports. The cyclic process of management, as illustrated in Figure 2.1, involves these steps: data acquisition leads to information generation which leads to decision making which leads to business process management. Perhaps the most critical task in this cyclic process is the reporting (i.e., information generation)— converting data from different sources into actionable information. 4. List and describe the three major categories of business reports. There are a wide variety of business reports, which for managerial purposes can be grouped into three major categories: metric management reports, dashboardtype reports, and balanced scorecard-type reports. Metric management reports involve outcome-oriented metrics based on service level agreements and/or key performance indicators. Dashboard-type reports present a range of performance indicators on one page, with both static/predefined elements and customizable widgets and views. Balanced scorecard reports present an integrated view of a company’s health and include financial, customer, business process, and learning/growth perspectives. 5. What are the main components of a business reporting system? A business reporting system includes several components. One is the online transaction processing system (ERP, POS, etc.) that records transactions. A second is a data supply that takes recorded events and transactions and delivers them to the reporting system. Next comes an ETL component that ensures quality and performs necessary transformations prior to loading the data into a data store. Then there is the data storage itself (such as a data warehouse). Business logic converts the data into the reporting outputs. Publication distributes or hosts the reports for end users. And finally assurance provides a quality control check on the reports and their dissemination
Section 2.8 Review Questions What is data visualization? Why is it needed? Data visualization, perhaps more appropriately called"information visualization is the use of visual representations to explore, make sense of, and communicate data. It is closely related to the fields of information graphics, scientific ualization, and statistical graphics. What is portrayed in visualizations is the information(aggregations, summarizations, and contextualization ) and not the data. Companies and individuals increasingly rely on data to make good decisions. Because data is so voluminous, there is a need for visual tools that help people understan What are the historical roots of data visualization Predecessors to data visualization date back to the second century AD Todays most popular visual forms date back a few centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s the now familiar line and bar charts date back to the late 1700s. Charles Joseph Minard used visualizations to graphically portray the losses suffered by Napoleon's army in the russian campaign of 1812 The 1900s saw the rise of a more formal, empirical attitude toward visualization which tended to focus on aspects such as color, value scales, and labeling. In the 2000s the Internet has emerged as a new medium for visualization, and added interactivity to previously static graphics 3. Carefully analyze Charles Joseph Minard's graphical portrayal of Napoleon march. Identify and comment on all of the information dimensions captured this ancient diagram In this graphic Minard managed to simultaneously represent several data dimensions, including the size of the army, direction of movement, geographic locations, outside temperature, etc. He did this in an artistic and informative manner. The background of the image is a map depicting the location of battles There is a thick lighter band that shows the size of Napoleon's army at each position, and a dark lower one that depicts the retreat. a line at the bottom depict temperatures at each position in time and space 4. Who is Edward Tufte? Why do you think we should know about his work? Edward Tufte is a statistician whose website chronicles many historical data visualizations, including Minard's graphic of Napoleons defeat. His work can bring insights into how to follow best practices for information visualization 5. What do you think is the next big thing" in data visualization? The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional Copyright C2018 Pearson Education, Inc
10 Copyright © 2018Pearson Education, Inc. Section 2.8 Review Questions 1. What is data visualization? Why is it needed? Data visualization, perhaps more appropriately called “information visualization,” is the use of visual representations to explore, make sense of, and communicate data. It is closely related to the fields of information graphics, scientific visualization, and statistical graphics. What is portrayed in visualizations is the information (aggregations, summarizations, and contextualization) and not the data. Companies and individuals increasingly rely on data to make good decisions. Because data is so voluminous, there is a need for visual tools that help people understand it. 2. What are the historical roots of data visualization? Predecessors to data visualization date back to the second century AD. Today’s most popular visual forms date back a few centuries. Geographical exploration, mathematics, and popularized history spurred the creation of early maps, graphs, and timelines as far back as the 1600s. The now familiar line and bar charts date back to the late 1700s. Charles Joseph Minard used visualizations to graphically portray the losses suffered by Napoleon’s army in the Russian campaign of 1812. The 1900s saw the rise of a more formal, empirical attitude toward visualization, which tended to focus on aspects such as color, value scales, and labeling. In the 2000s the Internet has emerged as a new medium for visualization, and added interactivity to previously static graphics. 3. Carefully analyze Charles Joseph Minard’s graphical portrayal of Napoleon’s march. Identify and comment on all of the information dimensions captured in this ancient diagram. In this graphic Minard managed to simultaneously represent several data dimensions, including the size of the army, direction of movement, geographic locations, outside temperature, etc. He did this in an artistic and informative manner. The background of the image is a map depicting the location of battles. There is a thick lighter band that shows the size of Napoleon’s army at each position, and a dark lower one that depicts the retreat. A line at the bottom depicts temperatures at each position in time and space. 4. Who is Edward Tufte? Why do you think we should know about his work? Edward Tufte is a statistician whose website chronicles many historical data visualizations, including Minard’s graphic of Napoleon’s defeat. His work can bring insights into how to follow best practices for information visualization. 5. What do you think is the “next big thing” in data visualization? The future of data/information visualization is very hard to predict. We can only extrapolate from what has already been invented: more three-dimensional