roceedings of the 40th Hawaii International Conference on System Sciences-2007 Supporting distributed scientific collaboration: Implications for designing the Cite Seer collaboratory Umer Farooq, Craig H. Ganoe, John M. Carroll, and C. Lee Giles Computer Supported Collaboration and Learning Lab, Center for HCI The intelligent Systems Research lab College of Information Sciences and Technology The Pennsylvania State University, University Park, PA 16802 USA fufaroog, canoe, carroll, giles/@ist. psu. edu Abstract strategies to support distributed scientific collaboration. This is because only a few It is unclear if and how collaboratories have collaboratories have been evaluated from this angle enhanced distributed scientific collaboration (e.g,[ 19, 21, resulting in just a handful of basic Furthermore, little is known in the way of design design issues and heuristics related to general strategies to support such collaboration. Based on a collaborative experiences in collaboratories(eg-[7, survey and follow-up interviews with Cite Seer users, we present four novel implications for designing the teared toward scientific activities CiteSeer collaboratory. First, visualise query-based distributed collaboration social networks to identify scholarly communities of We report the first phase(requirements) of interest. Second, provide online collaborative tool research investigation to design a collaboratory arou support for upstream stages of scientific collaboration. an existing, large-scale digital library of scientific Third, support activity awareness for staying cognizant literature in computing, namely CiteSeer of online scientific activities. Fourth, use notification (http://citeseer.istpsu.edu).Basedonasurveyand systems to comey scientific activity awareness follow-up interviews with CiteSeer users, we present Introduction distributed scientific collaboration. These implications are novel in that they extend current literature in the Scientific communities have traditionally formed domains of Human Computer Interaction(HCD) and round key intellectual resources such as collections of Computer Supported Cooperative Work(CSCW) as cyclo the past, one of the greatest obstacles to the formation 2. Related work and sustained vitality of scientific communities was the fact that members had to be co-located with their The US, through its National Science Foundation shared resources and with one another (NSF), has been involved in collaboratory initiatives Today, face-to-face scientific collaboration is A collaboratory is a"center without walls, in which the increasingly being augmented by online interactions. nation's researchers can perform their research without Collaboratories-laboratories for collaboration- regard to geographical location--interacting with enable large-scale scientific endeavors through Internet colleagues, accessing instrumentation, sharing data and technologies. Through such environments, scientists computational resource, and accessing information in can share key intellectual resources that allow digital libraries"[26]. The challenges and opportunities colleagues located anywhere to access, view, in creating collaboratories and their interfaces relate manipulate, and have discussions about these artifacts directly to many aspects of HCI and CSCw. As a [8,16] result of collaboratory development and HCI/CSCW Although collaboratories have the potential to research converging, a special issue of AcM partially attributed to the fact that most collaboratories collaboratories. An online list of collaboratories is also have been built as one-off, handcrafted projects and avalable [201 have thus been accepted as the status quo [7, 20] Because the literature on collaboratories is so vast Furthermore, little is known in the way of design and scattered, it is appropriate to summarize the majo 15301605/07$2000@2007TBE
Supporting distributed scientific collaboration: Implications for designing the CiteSeer collaboratory Umer Farooq* , Craig H. Ganoe* , John M. Carroll * , and C. Lee Giles + * Computer Supported Collaboration and Learning Lab, Center for HCI + The Intelligent Systems Research Lab College of Information Sciences and Technology The Pennsylvania State University, University Park, PA 16802 USA {ufarooq, cganoe, jcarroll, giles}@ist.psu.edu Abstract It is unclear if and how collaboratories have enhanced distributed scientific collaboration. Furthermore, little is known in the way of design strategies to support such collaboration. Based on a survey and follow-up interviews with CiteSeer users, we present four novel implications for designing the CiteSeer collaboratory. First, visualize query-based social networks to identify scholarly communities of interest. Second, provide online collaborative tool support for upstream stages of scientific collaboration. Third, support activity awareness for staying cognizant of online scientific activities. Fourth, use notification systems to convey scientific activity awareness. 1. Introduction Scientific communities have traditionally formed around key intellectual resources such as collections of books, or special equipment such as cyclotrons [24]. In the past, one of the greatest obstacles to the formation and sustained vitality of scientific communities was the fact that members had to be co-located with their shared resources and with one another. Today, face-to-face scientific collaboration is increasingly being augmented by online interactions. Collaboratories—laboratories for collaboration— enable large-scale scientific endeavors through Internet technologies. Through such environments, scientists can share key intellectual resources that allow colleagues located anywhere to access, view, manipulate, and have discussions about these artifacts [8, 16]. Although collaboratories have the potential to enhance distributed scientific collaboration, not much empirical evidence bears any mark of this. This can be partially attributed to the fact that most collaboratories have been built as one-off, handcrafted projects and have thus been accepted as the status quo [7, 20]. Furthermore, little is known in the way of design strategies to support distributed scientific collaboration. This is because only a few collaboratories have been evaluated from this angle (e.g., [19, 21], resulting in just a handful of basic design issues and heuristics related to general collaborative experiences in collaboratories (e.g., [7, 19]). In this paper, we present more specific design strategies geared toward scientific activities in distributed collaboration. We report the first phase (requirements) of our research investigation to design a collaboratory around an existing, large-scale digital library of scientific literature in computing, namely CiteSeer (http://citeseer.ist.psu.edu). Based on a survey and follow-up interviews with CiteSeer users, we present four implications for design in order to support distributed scientific collaboration. These implications are novel in that they extend current literature in the domains of Human Computer Interaction (HCI) and Computer Supported Cooperative Work (CSCW). 2. Related work The US, through its National Science Foundation (NSF), has been involved in collaboratory initiatives. A collaboratory is a “center without walls, in which the nation’s researchers can perform their research without regard to geographical location—interacting with colleagues, accessing instrumentation, sharing data and computational resource, and accessing information in digital libraries” [26]. The challenges and opportunities in creating collaboratories and their interfaces relate directly to many aspects of HCI and CSCW. As a result of collaboratory development and HCI/CSCW research converging, a special issue of ACM Interactions was published in 1998, comprising four key articles that offered an in-depth look at collaboratories. An online list of collaboratories is also available [20]. Because the literature on collaboratories is so vast and scattered, it is appropriate to summarize the major Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 1530-1605/07 $20.00 2007 IEEE © 1
roceedings of the 40th Hawaii International Conference on System Sciences-2007 findings from prior work. In 2002, Finholt [7] wrote a kinds of activities Cite Seer users would like to retrospective article in which he outlined a number of collaborate on and possible socio-technical issues design issues related to collaboratory development. during such collaboration More recently in 2006, Olson and colleagues [19] attempted to propose a theory of remote collaboration 4.1. Recruitment and participants based largely on their experience with collaboratory development. These two articles represent the state-of- The survey was made available on Cite Seers we the-art in designing collaboratories to support distributed scientific collaboration. We have codified the major findings from these sources in Table 1 were CiteSeer users willing to take the survey. No compensation was provided to survey respondents Table 1. Major findings from prior work administered survey during two weeks(November 17- 30, 2005). 301 Cite Seer users responded to the survey Summa Tightly coupled tasks require co-location. 18 Collaboration "readiness"and "technology[17] 4. 2. Survey design diness are essential factors for success Status misalignment can We report our results based on 23 survey questions communication between scientist organized into three broad sections. (1) Professional ommon ground and trust can hinder [19] interaction: seven questions related to how users boration would like to collaborate with others and what issues Management, planning, and decision-making[19] they might face. (2)CiteSeer use: seven questions are critical processes to provide support for related to Cite Seer usage behavior. (3) Background Allow flexibility to for identifying new [191 information: nine questions related to demographics of The questions were predominantly a mix of 3. Background of citeseer selection among pre-defined categories (e.g,age ranges) and ratings on 7-point Likert scales (e.g Our study context is CiteSeer[9: a search engine engagement in a specific activity on a scale of " Never nd digital library of literature in the computer and to Very often); few free-text opportunities were information science(CIS)disciplines that is a free provided (e.g, academic background). Based on pilot resource providing the full-text of nearly testing, the survey required 10-15 minutes to complete 00.000 academic 70 and over 10 million citations. Citeseer receives over 4.3. Data collection and analysis million hits a day and is accessed by 150 countries and 200,000 unique machines monthly Most survey questions solicited numerical It is traditional practice in the CIS scientific responses. Analysis of this data was done using SPSS community to make research documents available at Because we included multi-part questions in the the time they are first written through technical reports survey, it was important to check the reliability of the this practice has been transferred to the Web. CiteSeer internal consistency, with all Cronbach alpha actively and automatically harvests these documents coefficients reported above 0.7 and builds searchable and indexable collections We wanted to probe user responses in more detail promoting creative scientific discovery and reuse. Even in order to complement the quantitative data. The though search engines such as Google actively index second to last question asked for any type of Cite Seer, users come to Cite Seer for information such qualitative feedback from participants(e.g, related to as citation counts and domain dependent citation links CiteSeer); 94 participants responded. The last survey not provided by Google or Google Scholar question asked participants if they were willing to be interviewed via email 4. Methods We contacted 66 of these participants and got 22 responses. We asked four questions in the email Because CiteSeer has a large number of globally interview: (I)Which criteria would you find most distributed e chose to administer an online important for collaborating with CiteSeer users?(2) survey Broadly, we wanted to gain insight into the Which online collaborative activities would be most valuable to you? (3)Which activities would you like te
findings from prior work. In 2002, Finholt [7] wrote a retrospective article in which he outlined a number of design issues related to collaboratory development. More recently in 2006, Olson and colleagues [19] attempted to propose a theory of remote collaboration based largely on their experience with collaboratory development. These two articles represent the state-ofthe-art in designing collaboratories to support distributed scientific collaboration. We have codified the major findings from these sources in Table 1. Table 1. Major findings from prior work. Summary Source Tightly coupled tasks require co-location. [18] Collaboration “readiness” and “technology readiness” are essential factors for success. [17] Status misalignment can hamper communication between scientists. [7] Lack of common ground and trust can hinder collaboration. [19] Management, planning, and decision-making are critical processes to provide support for. [19] Allow flexibility to users for identifying new uses or functionality of tools. [19] 3. Background of CiteSeer Our study context is CiteSeer [9]: a search engine and digital library of literature in the computer and information science (CIS) disciplines that is a free resource providing access to the full-text of nearly 700,000 academic papers, and over 10 million citations. CiteSeer currently receives over half a million hits a day and is accessed by 150 countries and 200,000 unique machines monthly. It is traditional practice in the CIS scientific community to make research documents available at the time they are first written through technical reports series managed by various laboratories. More recently, this practice has been transferred to the Web. CiteSeer actively and automatically harvests these documents and builds searchable and indexable collections, promoting creative scientific discovery and reuse. Even though search engines such as Google actively index CiteSeer, users come to CiteSeer for information such as citation counts and domain dependent citation links not provided by Google or Google Scholar. 4. Methods Because CiteSeer has a large number of globally distributed users, we chose to administer an online survey. Broadly, we wanted to gain insight into the kinds of activities CiteSeer users would like to collaborate on and possible socio-technical issues during such collaboration. 4.1. Recruitment and participants The survey was made available on CiteSeer’s web site. Thus, this was an opportunity sample: participants were CiteSeer users willing to take the survey. No compensation was provided to survey respondents. In this paper, we report results based on the administered survey during two weeks (November 17- 30, 2005). 301 CiteSeer users responded to the survey. 4.2. Survey design We report our results based on 23 survey questions organized into three broad sections. (1) Professional interaction: seven questions related to how users would like to collaborate with others and what issues they might face. (2) CiteSeer use: seven questions related to CiteSeer usage behavior. (3) Background information: nine questions related to demographics of CiteSeer users. The questions were predominantly a mix of selection among pre-defined categories (e.g., age ranges) and ratings on 7-point Likert scales (e.g., engagement in a specific activity on a scale of “Never” to “Very often”); few free-text opportunities were provided (e.g., academic background). Based on pilot testing, the survey required 10-15 minutes to complete. 4.3. Data collection and analysis Most survey questions solicited numerical responses. Analysis of this data was done using SPSS. Because we included multi-part questions in the survey, it was important to check the reliability of the scales. The scales on the multi-part questions had good internal consistency, with all Cronbach alpha coefficients reported above 0.7. We wanted to probe user responses in more detail in order to complement the quantitative data. The second to last question asked for any type of qualitative feedback from participants (e.g., related to CiteSeer); 94 participants responded. The last survey question asked participants if they were willing to be interviewed via email. We contacted 66 of these participants and got 22 responses. We asked four questions in the email interview: (1) Which criteria would you find most important for collaborating with CiteSeer users? (2) Which online collaborative activities would be most valuable to you? (3) Which activities would you like to Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 2
roceedings of the 40th Hawaii International Conference on System Sciences-2007 stay most aware of?(4) What would be the best way papers. In this case, potential collaborators can share for you to stay aware of these activities? common ideas that focus on the papers they look for 5. Survey results want to] collaborate with Cite Seer users who are looking for similar papers as [me] and who cite similar papers Before reporting the results, we characterize the as do]. the reason is I can save more time to find a good survey respondents with respect to their demographics paper worth reading and can touch more ideas in my and patterns of Cite Seer usage Of the responses we received, 42% were graduate A concern in matching people based on readings or students. Males (89%) outnumbered females. More itations is the use of personal, sensitive information than half the respondents(52%)were in the age range e Surprisingly, no one indicated that using persona of 21-30 years information would be an issue. On the contrary, one 42%of the respondents had a master's degree. The interview respondent suggested that people's web sites sample as a whole was relatively highly educated, with could be used to identify potential collaborators 32% having a doctorate degree. Because Cite Seer is a [For] connecting users with common interests. focus on digital library primarily for the CIS disciplines, it was researchers home pages, because almost everyone I have consistent that 79% of the respondents had at least a seen from academia gives a links page computer science background One interview respondent provided an insight into The survey respondents represented a relatively how matching potential collaborators can also facilitate opportunistic collaboration outside of ones research CiteSeer was 3.7 years(SD=1.7). Almost half (45%) area and expertise had downloaded more than 100 papers from CiteSeer [An] important aspect to collaboration is to facilitate 40%said they use Cite Seer once or twice per week serendipitous interaction. As it is said, it's not what you don't We present the results under the following three lOW,it,'s what you don 't know that you don 't know. This is themes:(1)Potential collaborators:(2)Online closely related to the discovery of cross domain knowledge collaborative activities; and(3)Awareness issues and expertise The quantitative part of the survey did not probe 5.1. Potential collaborators users about the representation of social matching. As indicated by many survey respondents, social networks We wanted to understand with whom CiteSeer users are appropriate for depicting meaningful social would collaborate online. Participants were asked to structures in Citeseer rate how often they would like to interact remotely I think it would be great if I could get a Cite Seer page with others on a scale of 1(Never)to 7(Very often with a network diagram.and related strong links and more based on six items: (1)Who are looking for similar remote links clearly shown types of papers as I am;(2)Who read my papers; 3) Whose papers I read; (4) Who cite my papers; (5) 5.2. Online collaborative activities Whom I cite in my papers;(6)Who cite similar papers We wanted to know what kinds of collaboratiy ly high with all activities they would like supported. Participants were eans above 4(Sometimes): 4.76, 5.03, 5.10, 5.00, asked to rate how often they currently interact with 4.97, and 4.65 respectively. Because the quantitative others on a scale of 1(Never)to 7(Very often)based data is inconclusive. it is unclear which of the six on four items:(1)Strength en social connections; (2) collaborators. Qualitative data prioritizes some of these joint paper ew ideas; (3)Plan joint projects;(4)Write criteria will be most useful to match potential Brainstorm ne criteria In general, respondents rated all items moderate For example, people are likely to collaborate with high with all means above 4(Sometimes):4.28, 4.71 those who look for similar papers and read each other's 4.32, and 4.06 respectively. Participants were also papers. Reading similar papers is an indicator of people asked how difficult they would find these activities to working in the same area, as one respondent suggests: achieve remotely on a scale of I (Very easy)to 7(Very Important criteria: users who are reading the same and difficult). Responses indicate that CiteSeer found similar papers as me. Since we are reading the same these distributed collaborative activities to be on the ers, we are working in the exact same sub-area difficult side of neutral(4), with respective means as It seems plausible that someone who looks fo 4.40.4.36.4.53.and44 similar papers as another person also cites similar One interpretation of these results is that CiteSeer users moderately engage in these types of collaborative
stay most aware of? (4) What would be the best way for you to stay aware of these activities? 5. Survey results Before reporting the results, we characterize the survey respondents with respect to their demographics and patterns of CiteSeer usage. Of the responses we received, 42% were graduate students. Males (89%) outnumbered females. More than half the respondents (52%) were in the age range of 21-30 years. 42% of the respondents had a master’s degree. The sample as a whole was relatively highly educated, with 32% having a doctorate degree. Because CiteSeer is a digital library primarily for the CIS disciplines, it was consistent that 79% of the respondents had at least a computer science background. The survey respondents represented a relatively core group of CiteSeer users. Their mean (M) use of CiteSeer was 3.7 years (SD=1.7). Almost half (45%) had downloaded more than 100 papers from CiteSeer. 40% said they use CiteSeer once or twice per week. We present the results under the following three themes: (1) Potential collaborators; (2) Online collaborative activities; and (3) Awareness issues. 5.1. Potential collaborators We wanted to understand with whom CiteSeer users would collaborate online. Participants were asked to rate how often they would like to interact remotely with others on a scale of 1 (Never) to 7 (Very often) based on six items: (1) Who are looking for similar types of papers as I am; (2) Who read my papers; (3) Whose papers I read; (4) Who cite my papers; (5) Whom I cite in my papers; (6) Who cite similar papers as I do. The six items were rated relatively high with all means above 4 (Sometimes): 4.76, 5.03, 5.10, 5.00, 4.97, and 4.65 respectively. Because the quantitative data is inconclusive, it is unclear which of the six criteria will be most useful to match potential collaborators. Qualitative data prioritizes some of these criteria. For example, people are likely to collaborate with those who look for similar papers and read each other’s papers. Reading similar papers is an indicator of people working in the same area, as one respondent suggests: “Important criteria: users who are reading the same and similar papers as me. Since we are reading the same papers, we are working in the exact same sub-area.” It seems plausible that someone who looks for similar papers as another person also cites similar papers. In this case, potential collaborators can share common ideas that focus on the papers they look for or cite. One interview respondent expressed this view: “[I want to] collaborate with CiteSeer users who are looking for similar papers as [me] and who cite similar papers as [I do]…the reason is I can save more time to find a good paper worth reading and can touch more ideas in my research area by collaboration.” A concern in matching people based on readings or citations is the use of personal, sensitive information. Surprisingly, no one indicated that using personal information would be an issue. On the contrary, one interview respondent suggested that people’s web sites could be used to identify potential collaborators: “[For] connecting users with common interests…focus on researchers’ home pages, because almost everyone I have seen from academia gives a links page...” One interview respondent provided an insight into how matching potential collaborators can also facilitate opportunistic collaboration outside of one’s research area and expertise: “[An] important aspect to collaboration is to facilitate ‘serendipitous’ interaction. As it is said, it’s not what you don’t know, it’s what you don’t know that you don’t know. This is closely related to the discovery of cross domain knowledge and expertise.” The quantitative part of the survey did not probe users about the representation of social matching. As indicated by many survey respondents, social networks are appropriate for depicting meaningful social structures in CiteSeer: “I think it would be great if I could get a CiteSeer page with a ‘network’ diagram…and ‘related’ strong links and more remote links clearly shown.” 5.2. Online collaborative activities We wanted to know what kinds of collaborative activities they would like supported. Participants were asked to rate how often they currently interact with others on a scale of 1 (Never) to 7 (Very often) based on four items: (1) Strengthen social connections; (2) Brainstorm new ideas; (3) Plan joint projects; (4) Write joint papers. In general, respondents rated all items moderately high with all means above 4 (Sometimes): 4.28, 4.71, 4.32, and 4.06 respectively. Participants were also asked how difficult they would find these activities to achieve remotely on a scale of 1 (Very easy) to 7 (Very difficult). Responses indicate that CiteSeer users found these distributed collaborative activities to be on the difficult side of neutral (4), with respective means as 4.40, 4.36, 4.53, and 4.47. One interpretation of these results is that CiteSeer users moderately engage in these types of collaborative Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 3
roceedings of the 40th Hawaii International Conference on System Sciences-2007 activities. However, remote collaboration is perceived published in my area;(2)Who reads my papers; (3) as somewhat difficult. Qualitative data elaborates on New colleagues who are working in my area; (4)Who the kinds of online activities that Cite Seer users would cites my papers. like supported and gives reasons for not supporting Results suggest that stay ing aware was generally other activities that they perceive as difficult difficult as at least 50% of all respondents rated all Overwhelmingly, online discussions forums were items toward the agreement side of the scale. One-way the most popular type of distributed collaborative within-subjects ANOVA was conducted with the activity, as indicated by one of many respo awareness resources as the independent variable with I'd love to participate in forums or discussions about my four levels(the response items)and level of difficulty field, to see what is going on, and what other people think. (rating from I to 7)as the dependent variable Discussions can also be a valuable source for new The Levene test was significant at 0.001, so the ideas. The following interview respondent indicated assumption of homogeneity of variance was violated the fact that discussions can enable brainstorming Therefore, both Brown-Forsythe and Welch F-ratios I would be interested in]brainstorming new ideas related are reported. The ANOVa was significant, with F(3 to on line discussions 59444)=2268(p<0005)andF(3,105704)=22.08 Given that Cite Seer users collaborate with others in (p<0005) respectively. We computed a contrast test collaborative planning and writing endeavors, these between the first item (recent papers published in my activities should be supported online. However, area) and the other three items combined. Results according to our interview respondents, they are not indicate that the first item was rated significantly inclined to use such collaborative features. One lower, with F(1, 472.07)=3727(p<0005). Thus, interview respondent said Cite Seer users find it less difficult to stay aware of "Writing new papers and planning projects don 't seem like recently published papers in their area, perhaps activities people would actually do through a science portal because this is done traditionally (througl This respondent's view was corroborated by others subscriptions to journals and conference attendance) who thought that current ways (e.g., email) of Although our quantitative questions only asked achieving such joint endeavors would suffice about the difficulty in staying aware, qualitative data I think the online discussions and brainstorming could be suggests that awareness of CiteSeer resources and useful. For paper writing and project planning, I'd imagine activities of Cite Seer users around those resources is that the team would be cohesive and wed just use email or a important. An interview respondent said wiki to coordinate The most interesting awareness feature is] providing Trust and privacy are obvious factors in hampering statistics on your own papers(readers, citations) distributed collaboration. One respondent said Staying aware of new colleagues in ones research "Collaboration is based on mutual trust, and it cannot be area is also important to keep abreast of potential privacy comes to my mind-one would not be willing to share An interview respondent said nd their research focus gained easily via an Intemet site. Also, the question of collaborators, their activities his preliminary ideas to an unknown audience "l'd like to know who has started a new discussion thread Establishing trust and privacy are exacerbated when related to my area of interest, because I want to be aware potentially valuable ideas, which form the basis of what is going on outside my lab, and what other researchers scientific discovery, cannot be shared due to are thinking or focusing on institutional constraints or are shared and unethical Qualitative data also suggests that mining historical misused. For example, legal issues can hinder activ ities in Cite Seer to provide influence patterns and distributed collaboration, as indicated by the following impact assessment of intellectual resources can enrich nterview respondent: awareness information. An interview respondent "Some people will, no doubt, wish to be 'silent indicated the relevance of history for awareness and participants [in online collaboration] due to legal intellectual how it can also inform future impact of a discipline property issues It's always important to be aware of new research efforts starting up that are synergistic or disruptive relative to your 5.3. Awareness issues own. You might consider online analytics that give people some idea of where activity is centered and where We wanted to understand awareness issues in online going. It could tell you if, for example, interest in a discipline ollaboration and general CiteSeer use. Participants is dying down or ramping up were asked to rate their level of agreement on how In addition to historical information, supporting difficult they find it to stay aware of Cite Seer resources awareness of future activities is important to stay on a scale of 1( Strongly disagree)to 7(Strongly cognizant of current information. For instance agree) based on four items: (1)Recent papers
activities. However, remote collaboration is perceived as somewhat difficult. Qualitative data elaborates on the kinds of online activities that CiteSeer users would like supported and gives reasons for not supporting other activities that they perceive as difficult. Overwhelmingly, online discussions forums were the most popular type of distributed collaborative activity, as indicated by one of many respondents: “I’d love to participate in forums or discussions about my field, to see what is going on, and what other people think.” Discussions can also be a valuable source for new ideas. The following interview respondent indicated the fact that discussions can enable brainstorming: “[I would be interested in] brainstorming new ideas related to online discussions.” Given that CiteSeer users collaborate with others in collaborative planning and writing endeavors, these activities should be supported online. However, according to our interview respondents, they are not inclined to use such collaborative features. One interview respondent said: “Writing new papers and planning projects don’t seem like activities people would actually do through a science portal.” This respondent’s view was corroborated by others who thought that current ways (e.g., email) of achieving such joint endeavors would suffice: “I think the online discussions and brainstorming could be useful. For paper writing and project planning, I’d imagine that the team would be cohesive and we’d just use email or a wiki to coordinate.” Trust and privacy are obvious factors in hampering distributed collaboration. One respondent said: “Collaboration is based on mutual trust, and it cannot be gained easily via an Internet site. Also, the question of privacy comes to my mind—one would not be willing to share his preliminary ideas to an unknown audience.” Establishing trust and privacy are exacerbated when potentially valuable ideas, which form the basis of scientific discovery, cannot be shared due to institutional constraints, or are shared and unethically misused. For example, legal issues can hinder distributed collaboration, as indicated by the following interview respondent: “Some people will, no doubt, wish to be ‘silent participants’ [in online collaboration] due to legal intellectual property issues.” 5.3. Awareness issues We wanted to understand awareness issues in online collaboration and general CiteSeer use. Participants were asked to rate their level of agreement on how difficult they find it to stay aware of CiteSeer resources on a scale of 1 (Strongly disagree) to 7 (Strongly agree) based on four items: (1) Recent papers published in my area; (2) Who reads my papers; (3) New colleagues who are working in my area; (4) Who cites my papers. Results suggest that staying aware was generally difficult as at least 50% of all respondents rated all items toward the agreement side of the scale. One-way within-subjects ANOVA was conducted with the awareness resources as the independent variable with four levels (the response items) and level of difficulty (rating from 1 to 7) as the dependent variable. The Levene test was significant at 0.001, so the assumption of homogeneity of variance was violated. Therefore, both Brown-Forsythe and Welch F-ratios are reported. The ANOVA was significant, with F(3, 594.44) = 22.68 (p<.0005) and F(3, 1057.04) = 22.08 (p<.0005) respectively. We computed a contrast test between the first item (recent papers published in my area) and the other three items combined. Results indicate that the first item was rated significantly lower, with F(1, 472.07) = 37.27 (p<.0005). Thus, CiteSeer users find it less difficult to stay aware of recently published papers in their area, perhaps because this is done traditionally (through subscriptions to journals and conference attendance). Although our quantitative questions only asked about the difficulty in staying aware, qualitative data suggests that awareness of CiteSeer resources and activities of CiteSeer users around those resources is important. An interview respondent said: “[The most interesting awareness feature is] providing statistics on your own papers (readers, citations).” Staying aware of new colleagues in one’s research area is also important to keep abreast of potential collaborators, their activities, and their research focus. An interview respondent said: “I’d like to know who has started a new discussion thread related to my area of interest, because I want to be aware what is going on outside my lab, and what other researchers are thinking or focusing on.” Qualitative data also suggests that mining historical activities in CiteSeer to provide influence patterns and impact assessment of intellectual resources can enrich awareness information. An interview respondent indicated the relevance of history for awareness and how it can also inform future impact of a discipline: “It’s always important to be aware of new research efforts starting up that are synergistic or disruptive relative to your own. You might consider online ‘analytics’ that give people some idea of where activity is centered and where it is going…It could tell you if, for example, interest in a discipline is ‘dying down’ or ‘ramping up’.” In addition to historical information, supporting awareness of future activities is important to stay cognizant of current information. For instance, Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 4
roceedings of the 40th Hawaii International Conference on System Sciences-2007 Cite Seer users want to be notified when a specific multiple criteria, such as mutual reading of papers, event has taken place, as indicated bele citations and similar search behavior. Similar search I would find it more important to know when a paper was behavior seems to be a feasible candidate among these ntered into Cite Seer that cited one of my papers; that would choices for at least three reasons be a strong signal that I might have interest in it First, Cite Seer can easily keep track of users'search An important facet of awareness is how it will be behavior by storing and mining a history of user conveyed. Many interview respondents indicated the queries. CiteSeer queries-typically, noun phrases usefulness of Really Simple Syndication(RSS)feeds such as"user-centered design"-essentially filter the definitely want] RSS: it isnt intrusive(I get information space of available resources into specialized view en I want), information can be easily [and] automatically These views can processed, [and] I can get information in whatever way I investigations, research areas, or even sub-disciplines want(as emails, in my aggregator, in my browser,. Many queries are in effect reused in the sense that In addition to how awareness information can be someone else entered that query, or one like it, before conveyed, respondents indicated different types of Comparing these queries with similarity measures can information they would like to stay aware of. One provide social matching heuristics for users. respondent wanted to know about hot topics Second, search queries are universal. For example, imply ing popular topics) being discussed in forums. In social matching based on citations may not apply to all another example, a respondent was interested in papers users as everyone would not have a critical mass of for a specified area of interest(e.g, using keywords )or cited papers(e.g, graduate students) those that cite his/her work. Third, queries accurately convey first-hand Features that would be useful are alerts when new information about a user's interests. Queries that articles are posted that either contain keywords or cite work I cumulate over time related to the same topic can am interested in to keep abreast of what's new in my field indicate a strong interest in that topic. Of course, two Even though there are traditional ways of stay ing users submitting similar queries do not necessarily aware of new papers, using features to refine such want to collaborate, but the chance that collaboration awareness(e. g, through key words)seems desirable would be attractive at some level is more likely than individuals with totally different interests 6. Implications for design Scholarly communities and sub-communities can form around queries, just as they have traditionally Several of our results suggest specific strategies to formed around shared resources. Providing a virtual apport distributed scientific collaboration. The four place for scientists with common query interests to implications for design are the following. (1)Visualize share perspectives, related and updated information query-based social networks to identify scholarly and links, and so forth would enrich these queries for communities of interest. (2) Provide online everyone, and help scholarship and scholarly collaborative tool support for upstream stages of communities of interest or practice to form and scientific collaboration. (3)Support activity avareness to stay cognizant of online, asynchronous, and long- These scholarly communities could be codified term scientific activities. (4) Use notification systems to through social network analysis where shared queries convey scientific activity awareness peripheral to are the primary basis for links among persons in the users' primary task network. Query-based social networks would connect The implications are motivated by design rationale persons more or less directly, depending on how many based on survey results and related HCI/CSCW queries they shared, and how they were connected to literature. Design envisionment scenarios, conceptual others in the network. We might expect interesting schemas, and prototype screenshots are used to community phenomena to emerge from such networks illustrate the implications for design For example, the network could foster scientific collaboration, not just between members within a 6.. Visualize query-based social networks particular scientific group, but also between weak ties [10], scholars principally belonging to different: In regard to matching potential collaborators are connected through others. This can hel survey results support existing claims. Literature from CiteSeer users to identify new colleagues and potential social psychology asserts that people are attracted to collaborators more easily Social structures can also be used to discover and interests being one facet of this. In CiteSeer. reinforce cross-community bridges. Bridges are, at the identify ing users with similar interests can be based on most basic level, members of two or more distinct community organizations [14]. In a scientific
CiteSeer users want to be notified when a specific event has taken place, as indicated below: “I would find it more important to know when a paper was entered into CiteSeer that cited one of my papers; that would be a strong signal that I might have interest in it.” An important facet of awareness is how it will be conveyed. Many interview respondents indicated the usefulness of Really Simple Syndication (RSS) feeds: “[I] definitely [want] RSS: it isn’t intrusive (I get information when I want), information can be easily [and] automatically processed, [and] I can get information in whatever way I want (as emails, in my aggregator, in my browser, …).” In addition to how awareness information can be conveyed, respondents indicated different types of information they would like to stay aware of. One respondent wanted to know about “hot topics” (implying popular topics) being discussed in forums. In another example, a respondent was interested in papers for a specified area of interest (e.g., using keywords) or those that cite his/her work: “Features that would be useful are alerts when new articles are posted that either contain keywords or cite work I am interested in to keep abreast of what’s new in my field.” Even though there are traditional ways of staying aware of new papers, using features to refine such awareness (e.g., through keywords) seems desirable. 6. Implications for design Several of our results suggest specific strategies to support distributed scientific collaboration. The four implications for design are the following. (1) Visualize query-based social networks to identify scholarly communities of interest. (2) Provide online collaborative tool support for upstream stages of scientific collaboration. (3) Support activity awareness to stay cognizant of online, asynchronous, and longterm scientific activities. (4) Use notification systems to convey scientific activity awareness peripheral to users’ primary task. The implications are motivated by design rationale based on survey results and related HCI/CSCW literature. Design envisionment scenarios, conceptual schemas, and prototype screenshots are used to illustrate the implications for design. 6.1. Visualize query-based social networks In regard to matching potential collaborators, survey results support existing claims. Literature from social psychology asserts that people are attracted to “similar others” [23, p. 416], with similarity in interests being one facet of this. In CiteSeer, identifying users with similar interests can be based on multiple criteria, such as mutual reading of papers, citations, and similar search behavior. Similar search behavior seems to be a feasible candidate among these choices for at least three reasons. First, CiteSeer can easily keep track of users’ search behavior by storing and mining a history of user queries. CiteSeer queries—typically, noun phrases such as “user-centered design”—essentially filter the space of available resources into specialized views. These views can be thought of as research investigations, research areas, or even sub-disciplines. Many queries are in effect reused in the sense that someone else entered that query, or one like it, before. Comparing these queries with similarity measures can provide social matching heuristics for users. Second, search queries are universal. For example, social matching based on citations may not apply to all users as everyone would not have a critical mass of cited papers (e.g., graduate students). Third, queries accurately convey first-hand information about a user’s interests. Queries that cumulate over time related to the same topic can indicate a strong interest in that topic. Of course, two users submitting similar queries do not necessarily want to collaborate, but the chance that collaboration would be attractive at some level is more likely than individuals with totally different interests. Scholarly communities and sub-communities can form around queries, just as they have traditionally formed around shared resources. Providing a virtual place for scientists with common query interests to share perspectives, related and updated information and links, and so forth would enrich these queries for everyone, and help scholarship and scholarly communities of interest or practice to form and develop [25]. These scholarly communities could be codified through social network analysis where shared queries are the primary basis for links among persons in the network. Query-based social networks would connect persons more or less directly, depending on how many queries they shared, and how they were connected to others in the network. We might expect interesting community phenomena to emerge from such networks. For example, the network could foster scientific collaboration, not just between members within a particular scientific group, but also between weak ties [10], scholars principally belonging to different groups who are connected through others. This can help CiteSeer users to identify new colleagues and potential collaborators more easily. Social structures can also be used to discover and reinforce cross-community bridges. Bridges are, at the most basic level, members of two or more distinct community organizations [14]. In a scientific Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 5