Interaction design guidelines 3.2 Critiquing aid After recommended items are computed and displayed to the user, the critical concern now should be how to aid users in providing critiques to the item. As introduced before, there are principally two types of critiquing aids: the sys stem suggested critiquing approach that generates and proposes a limited set of critiques for users to select, and the user-initiated critiquing approach that does not offer pre- tiques, but allow ws users to create and compose critiques on their own. The user-initiated method is more flexible to support various critique forms. For example, in the Example Critiquing interface, users can choose to make similarity-based tiquing(e.g, "find some cameras similar to this one"), quality-based(e. g,"find a similar camera, but cheaper") or even quantity-based(e. g, "find something similar to this camera, but at least $100 cheaper"). However, the system-suggested critiquin approach is limited in this respect given that it is the system to determine the form. not the user. In fact, FindMe and Dynamic Critiquing only suggest quality-based cri- jues(e.g, "cheaper, ""bigger, " or"Different Manufacture, Lower Resolution and Cheaper) which were viewed as a compromise between the detail provided by value elicitation and the ease of feedback associated with preference-based methods(Smyth and McGinty 2003; McCarthy et al. 2005c) In reference to the Dynamic Critiquing interface, the critiquing aid can contain two sub-components: unit critiquing(on a single feature)and compound critiquing (on multiple features simultaneously) which are respectively termed UC and Cc in the following content. Each sub-component can be in either system-suggested or user initiated style. For example, the UC in FindMe(Burke et al. 1997)is system-suggested (e. g,"cheaper","bigger"), whereas in DynamicCritiquing, it is more user-initiated since users can choose which feature to critique and how to critique it. The CC support owever, is purely system-suggested because a limited set of compound critiques is proposed for users to select(usually three suggestions as shown in Fig. 2) In the Example Critiquing interface, both UC and CC are supported in the user initiated way. Specifically, the user can improve or compromise one feature at a time and leave the others unchanged (i.e, unit critique), or combine any set of unit critiques nto a compound critique Therefore, considering the degree of user control, the user-initiated method should llow for a higher level given that the control is largely in the hands of users, relative to the system-suggested critiquing approach where users can only""select", not"create However, it is hard to assert which method would certainly perform better in improving on real-users' decision performance and subjective attitudes Table I lists all of the discussed variables, with Dynamic Critiquing and Example 4 User evaluation framework le have conducted a series of three user trials. in order to understand the effect of these ariables on users' actual decision behavior and subjective perceptions. The first trial
Interaction design guidelines 177 3.2 Critiquing aid After recommended items are computed and displayed to the user, the critical concern now should be how to aid users in providing critiques to the item. As introduced before, there are principally two types of critiquing aids: the systemsuggested critiquing approach that generates and proposes a limited set of critiques for users to select, and the user-initiated critiquing approach that does not offer precomputed critiques, but allows users to create and compose critiques on their own. The user-initiated method is more flexible to support various critique forms. For example, in the ExampleCritiquing interface, users can choose to make similarity-based critiquing (e.g., “find some cameras similar to this one”), quality-based (e.g., “find a similar camera, but cheaper”) or even quantity-based (e.g., “find something similar to this camera, but at least $100 cheaper”). However, the system-suggested critiquing approach is limited in this respect given that it is the system to determine the form, not the user. In fact, FindMe and DynamicCritiquing only suggest quality-based critiques (e.g., “cheaper,” “bigger,” or “Different Manufacture, Lower Resolution and Cheaper”) which were viewed as a compromise between the detail provided by value elicitation and the ease of feedback associated with preference-based methods (Smyth and McGinty 2003; McCarthy et al. 2005c). In reference to the DynamicCritiquing interface, the critiquing aid can contain two sub-components: unit critiquing (on a single feature) and compound critiquing (on multiple features simultaneously) which are respectively termed UC and CC in the following content. Each sub-component can be in either system-suggested or userinitiated style. For example, the UC in FindMe (Burke et al. 1997) is system-suggested (e.g., “cheaper”, “bigger”), whereas in DynamicCritiquing, it is more user-initiated since users can choose which feature to critique and how to critique it. The CC support in DynamicCritiquing, however, is purely system-suggested because a limited set of compound critiques is proposed for users to select (usually three suggestions as shown in Fig. 2). In the ExampleCritiquing interface, both UC and CC are supported in the userinitiated way. Specifically, the user can improve or compromise one feature at a time and leave the others unchanged (i.e., unit critique), or combine any set of unit critiques into a compound critique. Therefore, considering the degree of user control, the user-initiated method should allow for a higher level given that the control is largely in the hands of users, relative to the system-suggested critiquing approach where users can only “select”, not “create”. However, it is hard to assert which method would certainly perform better in improving on real-users’ decision performance and subjective attitudes. Table 1 lists all of the discussed variables, with DynamicCritiquing and ExampleCritiquing as examples to see their typical values. 4 User evaluation framework We have conducted a series of three user trials, in order to understand the effect of these variables on users’ actual decision behavior and subjective perceptions. The first trial 123
178 L Chen P Pu Table 1 Summary of control variables in a critiquing-based recommender system and the main differences between DynamicCritiquing and Example Critiquing in respect of these aspects Critiquing coverage Number of Number of Compound litian recom- recommended items critiquing after each (UC) antiquing(NCR) Single item User-initiated System-suggested (McCarthy et al. 2005c) k items(k=7) k items(k=7) User-initiated User-initiated ( Chen and Pu 2006) as a comparative user study of the two typical applications: Dynamic Critiquing and Example Critiquing, with the purpose of identifying which one would perform more effectively. In the second trial, we made some changes on the two systems to make them different only on one dimension, the critiquing aid, in order to observe the single elements influence. The third trial measured users' performance in a hybrid critiquing system where the two types of critiquing aids: system-suggested and user-initiated were combined on the same screen Combining the results from these three trials, we expected to reveal the effects of different independent variables on users'decision performance and quality Therefore, before carrying out these experiments, it was necessary to first define concrete dependent variables that we were to measure. We have established an evalua tion framework aimed to contain all of key standards. In fact, identifying the appropri- ate criteria for evaluating the true benefits of a recommender system is a challenging issue Related work has primarily focused on users'objective interaction effort, such as their interaction sessions(McCarthy et al. 2005a, b, c)and task completion time, while placing less emphasis on what actual decision accuracy In fact, the accuracy-effort model has long been studied in the domain of classical decision theories( Payne et al 1993: Spiekermann and Parachiv 2002), and it has been broadly accepted that both important to determine the fundamental user benefits of a decision suppor the system's ideal goal should be to enable its users to obtain high level of de ccuracy with low amount of effort(Haubl and Trifts 2000) In addition, a recommender systems ability in increasing user trust and convincing them of its recommendations, such as which camera to purchase, is also a crucial factor, particularly meaningful when the system is applied in the e-commerce envi- ronment. Two main trust-inspired behavioral intentions(called trusting intentions) include intention to purchase indi whether the system could stimulate its users to purchase a product, and intention to return referring whether the system could prompt users to return to it for future use so that a long-term relationship is estab lished( Grabner-Krauter and Kaluscha 2003) Therefore, motivated by these requirements, we have classified them into three categories of dependent variables in our evaluation framework: decision accuracy decision effort and trusting intentions(see Fig 4
178 L. Chen, P. Pu Table 1 Summary of control variables in a critiquing-based recommender system and the main differences between DynamicCritiquing and ExampleCritiquing in respect of these aspects Critiquing coverage Critiquing aid Number of initial recommendations (NIR) Number of recommended items after each critiquing (NCR) Unit critiquing (UC) Compound critiquing (CC) DynamicCritiquing (McCarthy et al. 2005c) Single item Single item User-initiated System-suggested ExampleCritiquing (Chen and Pu 2006) k items (k = 7) k items (k = 7) User-initiated User-initiated was a comparative user study of the two typical applications: DynamicCritiquing and ExampleCritiquing, with the purpose of identifying which one would perform more effectively. In the second trial, we made some changes on the two systems to make them different only on one dimension, the critiquing aid, in order to observe the single element’s influence. The third trial measured users’ performance in a hybrid critiquing system where the two types of critiquing aids: system-suggested and user-initiated, were combined on the same screen. Combining the results from these three trials, we expected to reveal the effects of different independent variables on users’ decision performance and quality. Therefore, before carrying out these experiments, it was necessary to first define concrete dependent variables that we were to measure. We have established an evaluation framework aimed to contain all of key standards. In fact, identifying the appropriate criteria for evaluating the true benefits of a recommender system is a challenging issue. Related work has primarily focused on users’ objective interaction effort, such as their interaction sessions (McCarthy et al. 2005a,b,c) and task completion time, while placing less emphasis on what actual decision accuracy users can eventually achieve and how much cognitive effort users perceive to exert. In fact, the accuracy-effort model has long been studied in the domain of classical decision theories (Payne et al. 1993; Spiekermann and Parachiv 2002), and it has been broadly accepted that they are both important to determine the fundamental user benefits of a decision support, since the system’s ideal goal should be to enable its users to obtain high level of decision accuracy with low amount of effort (Häubl and Trifts 2000). In addition, a recommender system’s ability in increasing user trust and convincing them of its recommendations, such as which camera to purchase, is also a crucial factor, particularly meaningful when the system is applied in the e-commerce environment. Two main trust-inspired behavioral intentions (called trusting intentions) include intention to purchase indicating whether the system could stimulate its users to purchase a product, and intention to return referring whether the system could prompt users to return to it for future use so that a long-term relationship is established (Grabner-Kräuter and Kaluscha 2003). Therefore, motivated by these requirements, we have classified them into three categories of dependent variables in our evaluation framework: decision accuracy, decision effort and trusting intentions (see Fig. 4). 123
Interaction design guidelines Fig 4 User evaluation framework for critiquing-based Critiquing- Based Recommender Systems Decision Effort Objective Accuracy Perceived Accuracy Perceived Efort Intention to Purchase Intention to Return 4.1 Decision accuracy The foremost criterion of evaluating a recommender system should be the decision accuracy that it enables users to eventually achieve. If a user can target her ideal choice with the system, it means that the system assisted her in reaching 100% decision accuracy. In our experiments, we not only measured the objective accuracy that a participant may obtain, but also her subjectively perceived accuracy (i. e, confidence in choice) Objective accuracy The objective accuracy was quantitatively measured by the fraction of partici who switched to a different, better option than the one chosen with the system, when they were asked to view all alternatives in the database. This procedure is known as the switching task, and has been practically applied by researchers in marketing science to measure consumers' decision quality(Haubl and Trifts 2000). A lower switchin fraction means that the system supports higher decision accuracy since most of users stood by their choice with it. On the contrary, a higher switching fraction implies hat the recommender is not very capable of guiding users to locate what they truly antFor expensive products, inaccurate tools may cause both financial damage and emotional burden to a decision maker Besides objective accuracy, we also measured the degree of accuracy users subjec tively perceived while using the system, which is also called decision confidence (Pu and Kumar 2004). The confidence judgment may potentially impact on users perception of the system's competence and even their intention to purchase the chosen product. This variable was quantitatively assessed by asking subjects to respond to a statement(. g, "I am confident that the product I just'purchased' is really the best
Interaction design guidelines 179 Fig. 4 User evaluation framework for critiquing-based recommender systems Trusting Intentions Decision Effort Objective Effort Perceived Effort Decision Accuracy Intention to Purchase Intention to Return Critiquing-Based Recommender Systems Objective Accuracy Perceived Accuracy 4.1 Decision accuracy The foremost criterion of evaluating a recommender system should be the decision accuracy that it enables users to eventually achieve. If a user can target her ideal choice with the system, it means that the system assisted her in reaching 100% decision accuracy. In our experiments, we not only measured the objective accuracy that a participant may obtain, but also her subjectively perceived accuracy (i.e., confidence in choice). Objective accuracy The objective accuracy was quantitatively measured by the fraction of participants who switched to a different, better option than the one chosen with the system, when they were asked to view all alternatives in the database. This procedure is known as the switching task, and has been practically applied by researchers in marketing science to measure consumers’ decision quality (Häubl and Trifts 2000). A lower switching fraction means that the system supports higher decision accuracy since most of users stood by their choice with it. On the contrary, a higher switching fraction implies that the recommender is not very capable of guiding users to locate what they truly want. For expensive products, inaccurate tools may cause both financial damage and emotional burden to a decision maker. Perceived accuracy Besides objective accuracy, we also measured the degree of accuracy users subjectively perceived while using the system, which is also called decision confidence (Pu and Kumar 2004). The confidence judgment may potentially impact on users’ perception of the system’s competence and even their intention to purchase the chosen product. This variable was quantitatively assessed by asking subjects to respond to a statement (e.g., “I am confident that the product I just ‘purchased’ is really the best 123
L Chen P Pu Table 2 Questions to measure subjective perceptions Measured subjective variables Questions each responded on a 5-point Likert scale from Perceived decision accuracy I am confident that the product I just"purchased"is really the best choice for me Perceived effort I easily found the in nI Looking for a product using this interface required too much effort(reverse scale Intention to purchase I would purchase the product I just chose if given the opportunity like this was available, I would be very likely to usethinterface If I had to search for a product online in the future and an interface I dont like this interface. so I would choice for me")on a 5-point Likert scale ranging from"strongly disagree"to"strongly agree see Table 2 4.2 Decision effort According to the accuracy-effort framework(Payne et al. 1993), another important criterion is the amount of decision effort users expended in making their choice with the system. Similar to decision accuracy, we not only measured how much objective effort users actually consumed, but also their perceived cognitive effort which we hope would indicate the amount of subjective effort people exerted The objective effort further includes two dimensions: task time and interaction effort The task time is the total time a subject spent from she started using the system till she made her final choice. The interaction effort mainly considers the amount of interaction cycles(e.g, critiquing cycles)that a user was involved. The two variables have been widely used as main measurements in related work to evaluate their recommender systems(McCarthy et al. 2005b. c). Perceived effort Perceived effort refers to the psychological cognitive cost of information-processing It represents the ease with which the subject can perform the task of obtaining and processing the relevant information in order to arrive at her decision. Since it is a subjective variable, two unified scale items(e. g, "I easily found the information I was looking for") were used to quantify its value(see Table 2 of concrete questions 4.3 Accuracy and effort The objective and subjective assessments of both decision accuracy and decision effort can not only show their respecti ve value es. but also allow us to understand how the concepts are interrelated
180 L. Chen, P. Pu Table 2 Questions to measure subjective perceptions Measured subjective variables Questions each responded on a 5-point Likert scale from “strongly disagree” to “strongly agree” Perceived decision accuracy I am confident that the product I just “purchased” is really the best choice for me. Perceived effort I easily found the information I was looking for. Looking for a product using this interface required too much effort (reverse scale). Intention to purchase I would purchase the product I just chose if given the opportunity. Intention to return If I had to search for a product online in the future and an interface like this was available, I would be very likely to use it. I don’t like this interface, so I would not use it again (reverse scale). choice for me”) on a 5-point Likert scale ranging from “strongly disagree” to “strongly agree” see Table 2. 4.2 Decision effort According to the accuracy-effort framework (Payne et al. 1993), another important criterion is the amount of decision effort users expended in making their choice with the system. Similar to decision accuracy, we not only measured how much objective effort users actually consumed, but also their perceived cognitive effort which we hope would indicate the amount of subjective effort people exerted. Objective effort The objective effort further includes two dimensions: task time and interaction effort. The task time is the total time a subject spent from she started using the system till she made her final choice. The interaction effort mainly considers the amount of interaction cycles (e.g., critiquing cycles) that a user was involved. The two variables have been widely used as main measurements in related work to evaluate their recommender systems (McCarthy et al. 2005b,c). Perceived effort Perceived effort refers to the psychological cognitive cost of information-processing. It represents the ease with which the subject can perform the task of obtaining and processing the relevant information in order to arrive at her decision. Since it is a subjective variable, two unified scale items (e.g., “I easily found the information I was looking for”) were used to quantify its value (see Table 2 of concrete questions). 4.3 Accuracy and effort The objective and subjective assessments of both decision accuracy and decision effort can not only show their respective values, but also allow us to understand how the concepts are interrelated. 123