Case 1: 08-CV-03139-RRM-RER Document 51-6 Filed 03/17/10 Page 1 of 47 The Lane's Gifts v Google Report Alexander tuzhilin Table of contents 1. Dr. Tuzhilin's Background 1 2. Materials Reviewed 2 3. Google Personnel Interviewed 3 4. Development of the Internet 4 5. Growth of Search Engines and Googles History 5 6. Development of the Pay-per-Click Advertising Model 6 7. Googles Pay-per-Click Advertising Model 9 8. Invalid Clicks and Google's Definition 15 9. Google's Approach to Detecting Invalid Clicks 21 10. Conclusions 46 Executive Summary I have been asked to evaluate google's invalid click detection efforts and to conclude whether these efforts are reasonable or not. As a part of this evaluation, I have visited Googles campus three times, examined various internal documents, interviewed several Google's employees, have seen different demos of their invalid click inspection system and examined internal reports and charts showing various aspects of performance of Googles invalid click detection system. Based on all these studied materials and the information narrated to me by Google's employees, I conclude that Google's efforts to combat click fraud are reasonable. In the rest of this report, I elaborate on this point 1. Dr. Tuzhilin's Background I have recently been appointed as a Professor of Information Systems at the Stern School of Business at New York University(NYU), having previously served as an Associate Professor at the Stern School. I received my Ph. D. in Computer Science from the Courant Institute of Mathematical Sciences. NyU in 1989. M.S. in engineering Economics from the School of Engineering at Stanford University in 1981, and BA.in Mathematics from NYU in 1980 My current research interests include knowledge discovery in databases(data mining), personalization, Customer Relationship Management(CRM) and Internet marketing. My
1 The Lane’s Gifts v. Google Report Alexander Tuzhilin Table of Contents 1. Dr. Tuzhilin’s Background 1 2. Materials Reviewed 2 3. Google Personnel Interviewed 3 4. Development of the Internet 4 5. Growth of Search Engines and Google’s History 5 6. Development of the Pay-per-Click Advertising Model 6 7. Google’s Pay-per-Click Advertising Model 9 8. Invalid Clicks and Google’s Definition 15 9. Google’s Approach to Detecting Invalid Clicks 21 10. Conclusions 46 Executive Summary I have been asked to evaluate Google’s invalid click detection efforts and to conclude whether these efforts are reasonable or not. As a part of this evaluation, I have visited Google’s campus three times, examined various internal documents, interviewed several Google’s employees, have seen different demos of their invalid click inspection system, and examined internal reports and charts showing various aspects of performance of Google’s invalid click detection system. Based on all these studied materials and the information narrated to me by Google’s employees, I conclude that Google’s efforts to combat click fraud are reasonable. In the rest of this report, I elaborate on this point. 1. Dr. Tuzhilin’s Background I have recently been appointed as a Professor of Information Systems at the Stern School of Business at New York University (NYU), having previously served as an Associate Professor at the Stern School. I received my Ph.D. in Computer Science from the Courant Institute of Mathematical Sciences, NYU in 1989, M.S. in Engineering Economics from the School of Engineering at Stanford University in 1981, and B.A. in Mathematics from NYU in 1980. My current research interests include knowledge discovery in databases (data mining), personalization, Customer Relationship Management (CRM) and Internet marketing. My Case 1:08-cv-03139-RRM -RER Document 51-6 Filed 03/17/10 Page 1 of 47
Case 1: 08-CV-03139-RRM-RER Document 51-6 Filed 03/17/10 Page 2 of 47 prior research was done in the areas of temporal databases, query-driven simulations and the development of specification languages for modeling business processes. I have co- authored over 70 papers on these topics published in major Computer Science and Information Systems journals, conferences and other outlets i currently serve on the Editorial Boards of the IEEE Transactions on Knowledge and Data Engineering, the Data Mining and Knowledge Discovery Journal, the INFORMS Journal on Computing, and the Electronic Commerce Research Journal. I have also co-chaired the Program Committees of the IEEE International Conference on Data Mining(ICDM)in 2003 and the 2005 International Workshop on Customer Relationship Management that brought together researchers from the data mining and marketing communities to explore and promote an interdisciplinary focus on CRM. i have also served on numerous program and organizing committees of major conferences in the fields of Data Mining and Informatio Systems. I have also had visiting academic appointments at the Wharton School of University of Pennsylvania, Computer Science Department of Columbia University, and Ecole Nationale Superieure des telecommunications in Paris, france On the industrial side, I worked as a developer at Information Builders, Inc in New York for two years and consulted for various companies, including Lucent's Bell Laboratorie on a data mining project and Click Forensics on a click fraud detection project Additional information about my background can be found in my Cv in the appendix 2. Materials Reviewed During this project, I reviewed the following materials 1. Internal documents provided to me by Google, including the following documents Type of data collected and statistics/signals used for the detection of invalid click: Description of the filtering methods Description of the log generation and log transformation/aggregation system used for the analysis and detection of invalid clicks Description of the AdSense auto-termination system Description of the duplicate AdSense account detection system Description of the ad conversion system Description of the AdSense publisher investigation, flagging and termination Click Quality investigative processes, including the rules on when and how to terminate the publishers Description of the advertiser credit processes and systems Description of the inquiry handling processes and guidelines Description of the attack simulation system Description of the alerting system
2 prior research was done in the areas of temporal databases, query-driven simulations and the development of specification languages for modeling business processes. I have coauthored over 70 papers on these topics published in major Computer Science and Information Systems journals, conferences and other outlets. I currently serve on the Editorial Boards of the IEEE Transactions on Knowledge and Data Engineering, the Data Mining and Knowledge Discovery Journal, the INFORMS Journal on Computing, and the Electronic Commerce Research Journal. I have also co-chaired the Program Committees of the IEEE International Conference on Data Mining (ICDM) in 2003 and the 2005 International Workshop on Customer Relationship Management that brought together researchers from the data mining and marketing communities to explore and promote an interdisciplinary focus on CRM. I have also served on numerous program and organizing committees of major conferences in the fields of Data Mining and Information Systems. I have also had visiting academic appointments at the Wharton School of University of Pennsylvania, Computer Science Department of Columbia University, and Ecole Nationale Superieure des Telecommunications in Paris, France. On the industrial side, I worked as a developer at Information Builders, Inc. in New York for two years and consulted for various companies, including Lucent’s Bell Laboratories on a data mining project and Click Forensics on a click fraud detection project. Additional information about my background can be found in my CV in the Appendix. 2. Materials Reviewed During this project, I reviewed the following materials: 1. Internal documents provided to me by Google, including the following documents: • Type of data collected and statistics/signals used for the detection of invalid clicks • Description of the filtering methods • Description of the log generation and log transformation/aggregation system used for the analysis and detection of invalid clicks. • Description of the AdSense auto-termination system • Description of the duplicate AdSense account detection system • Description of the ad conversion system • Description of the AdSense publisher investigation, flagging and termination systems • Description of various Click Quality investigative processes, including the rules on when and how to terminate the publishers • Description of the advertiser credit processes and systems • Description of the inquiry handling processes and guidelines • Description of the attack simulation system • Description of the alerting system Case 1:08-cv-03139-RRM -RER Document 51-6 Filed 03/17/10 Page 2 of 47
Case 1: 08-CV-03139-RRM-RER Document 51-6 Filed 03/17/10 Page 3 of 47 History of the doubleclicking action Overview of the Click Quality teams high-priority projects Investigative reports generated by 3 different inspection systems that investigated three different cases of invalid clicking activities. One was an attack on an advertiser by an automated system, another one was an attack on a publisher by an automated system, and the third one was a general investigation of certain suspicious clicking activities. These reports were generated as a part of giving me demos on how Googles inspection systems worked and how manual offline investigations are typically conducted by Google personnel Different internal reports and charts showing various aspects of performance of Google's invalid click detection systems 2. Demos of various invalid click detection and inspection systems developed by the Click Quality team. Of course, these demos were provided only for the Click Quality systems that can be demoed(e.g, have appropriate User Interfaces) 3. Interviews with Google personnel, as described in the next section This report is based on this reviewed information and on the information narrated to me by google personnel during the interviews 3. Google Personnel Interviewed All the invalid click detection activities are performed by the Click Quality team at Google. The Click Quality team consists of the following two subgroups Engineering Responsible for the design and development of online filters and other nvalid click detection software. It consists primarily of engineers and currently has about a dozen staff members on the team Spam Operations Responsible primarily for the offline operations, inspections of invalid clicking activities including investigations of customer's inquiries. The group currently has about two dozens staff members on the team In addition, several other groups at Google, including Web spam, Ads quality Publications quality and others interact with the Click Quality team and provide their expertise on the issues that are related to invalid clicks(.g, Web spam and click fraud have some issues in common). Overall, the Click Quality team can draw upon the knowledge and expertise of a few dozens of other people on these teams, whenever equired The two groups, although located in different parts of the Google campus, interact closely with each other
3 • History of the doubleclicking action • Overview of the Click Quality team’s high-priority projects • Investigative reports generated by 3 different inspection systems that investigated three different cases of invalid clicking activities. One was an attack on an advertiser by an automated system, another one was an attack on a publisher by an automated system, and the third one was a general investigation of certain suspicious clicking activities. These reports were generated as a part of giving me demos on how Google’s inspection systems worked and how manual offline investigations are typically conducted by Google personnel. • Different internal reports and charts showing various aspects of performance of Google’s invalid click detection systems. 2. Demos of various invalid click detection and inspection systems developed by the Click Quality team. Of course, these demos were provided only for the Click Quality systems that can be demoed (e.g., have appropriate User Interfaces). 3. Interviews with Google personnel, as described in the next section. This report is based on this reviewed information and on the information narrated to me by Google personnel during the interviews. 3. Google Personnel Interviewed All the invalid click detection activities are performed by the Click Quality team at Google. The Click Quality team consists of the following two subgroups • Engineering Responsible for the design and development of online filters and other invalid click detection software. It consists primarily of engineers and currently has about a dozen staff members on the team. • Spam Operations Responsible primarily for the offline operations, inspections of invalid clicking activities including investigations of customer’s inquiries. The group currently has about two dozens staff members on the team. In addition, several other groups at Google, including Web spam, Ads quality, Publications quality and others interact with the Click Quality team and provide their expertise on the issues that are related to invalid clicks (e.g., Web spam and click fraud have some issues in common). Overall, the Click Quality team can draw upon the knowledge and expertise of a few dozens of other people on these teams, whenever required. The two groups, although located in different parts of the Google campus, interact closely with each other. Case 1:08-cv-03139-RRM -RER Document 51-6 Filed 03/17/10 Page 3 of 47
Case 1: 08-CV-03139-RRM-RER Document 51-6 Filed 03/17/10 Page 4 of 47 In addition, the Product Manager of the Trust and Safety Group works closely with the Click Quality team on more business oriented and public relat g invalid click detection During this project, I visited Google campus three times and interviewed over a dozen of the Click Quality team members from the Spam Operations and the Engineering groups, as well as the Product Manager of the Trust and Safety Group. I found the members of both groups to be well-qualified and highly competent to perform their jobs. Most of them have relevant prior backgrounds and strong credentials Before focusing on the Pay-per-Click advertising model and Google's efforts to combat valid clicks, I first provide some background materials on the Internet and the growth of the search engines to put these main topics into perspective 4. Development of the Internet The Internet is a worldwide system of interconnected computer networks that transmit data using packet switching methods of the Internet Protocol (IP). Computing devises attached to the Internet can exchange data of various types, from emails to text documents to video and audio files, over the pathways connecting computer networks These documents are partitioned into pieces, called packets, by the Internet Protocol and travel over the pathways in a flexible manner determined by routers and other devices controlling the Internet traffic. These packets are assembled back in the proper order at the destination site using the well-developed principles of the Internet Protocol Internet was developed long time ago. The predecessor of the Internet(called the ARPANET) was developed in late 1960s and early 1970s. The first wide area Internet network was operational by January 1983 when the National Science Foundation constructed a network connecting various universities. The Internet was opened to commercial interests in 1985 Prior to the 1990s, Internet was predominately used by the people with strong technical skills because most of the Internet applications at that time required such skills, and only relatively few people had these skills in those days. This situation changed dramatically and the Internet became much more accessible to the general public after the invention of the world wide web (www) by Tim Berners-Lee in 1989 WWw is a globally connected network of Web servers and browsers that allows transferring different types of Web pages and other documents containing text, images, audio, video and other multimedia resources over the Internet using a special type of protocol developed specifically for the Web(the so-called Http protocol). Each esource on the www(such as a Web page)has a unique global identifier(Uniform Resource Identifier(or Locator)-URI (URL), So that each such resource can be found
4 In addition, the Product Manager of the Trust and Safety Group works closely with the Click Quality team on more business oriented and public relations issues pertaining to invalid click detection. During this project, I visited Google campus three times and interviewed over a dozen of the Click Quality team members from the Spam Operations and the Engineering groups, as well as the Product Manager of the Trust and Safety Group. I found the members of both groups to be well-qualified and highly competent to perform their jobs. Most of them have relevant prior backgrounds and strong credentials. Before focusing on the Pay-per-Click advertising model and Google’s efforts to combat invalid clicks, I first provide some background materials on the Internet and the growth of the search engines to put these main topics into perspective. 4. Development of the Internet The Internet is a worldwide system of interconnected computer networks that transmit data using packet switching methods of the Internet Protocol (IP). Computing devises attached to the Internet can exchange data of various types, from emails to text documents to video and audio files, over the pathways connecting computer networks. These documents are partitioned into pieces, called packets, by the Internet Protocol and travel over the pathways in a flexible manner determined by routers and other devices controlling the Internet traffic. These packets are assembled back in the proper order at the destination site using the well-developed principles of the Internet Protocol. Internet was developed long time ago. The predecessor of the Internet (called the ARPANET) was developed in late 1960’s and early 1970’s. The first wide area Internet network was operational by January 1983 when the National Science Foundation constructed a network connecting various universities. The Internet was opened to commercial interests in 1985. Prior to the 1990’s, Internet was predominately used by the people with strong technical skills because most of the Internet applications at that time required such skills, and only relatively few people had these skills in those days. This situation changed dramatically and the Internet became much more accessible to the general public after the invention of the World Wide Web (WWW) by Tim Berners-Lee in 1989. WWW is a globally connected network of Web servers and browsers that allows transferring different types of Web pages and other documents containing text, images, audio, video and other multimedia resources over the Internet using a special type of protocol developed specifically for the Web (the so-called HTTP protocol). Each resource on the WWW (such as a Web page) has a unique global identifier (Uniform Resource Identifier (or Locator) – URI (URL)), so that each such resource can be found Case 1:08-cv-03139-RRM -RER Document 51-6 Filed 03/17/10 Page 4 of 47
Case 1: 08-Cv-03139-RRM-RER Document 51-6 Filed 03/17/10 Page 5 of 47 and accessed. Web pages are created using special markup languages, such as HTML or XML that contain commands telling the browser how to display information contained in these pages. The markup languages also contain commands for linking the page to other pages, thus creating a hypertext environment that lets the Web user navigate from one Web page to another using these links(clicking on them) and thus letting the users to surf the Web The development of the World wide web. Web documents and Web browsers for displaying these documents in a user-friendly fashion, made Internet much more user- friendly. This opened Internet to the less technologically savvy general public that simply wanted to display, access and exchange various types of information without resorting to complicated technical means that were needed before to achieve these goals. B developing the Web and thus making the tasks of displaying, accessing and exchanging information over the Internet much simpler, spawned the development of various types of websites that collect, organize and provide systematic access to Web documents. The number of these websites experienced explosive growth in the 1990s and continued to grow rapidly worldwide up until now Massive volumes of Web documents were created over a short period of time since the invention of the Www. To deal with this information overload it was necessary to search and find relevant documents among millions(and later billions) of Web pages spread all over the world among numerous websites. This gave rise to the creation and growth of search engines designed to search and find relevant information in the massive volumes of Web documents 5. Growth of Search Engines and Google's History A search engine finds information requested by the user that is located somewhere on the World wide Web or other places, including proprietary networks and sites, and on a personal computer. The user formulates a search query, and the search engine looks for documents and other content satisfying the search criteria of the query. Typically, these search queries contain a list of keywords or phrases and retrieve documents that match these queries. Although the search can be done in various environments, including corporate intranets, the majority of the search has been done on the Web for different kinds of documents and information available on the Web. Since searching these documents directly on the Web is prohibitively time consuming, all the search engines indexes to provide efficient retrieval of the searched information. These indexes ar maintained regularly in order to keep them current The history of search engines goes back to Archie and Gopher, two tools designed in 1990-1991 for searching files located at the publicly accessible FTP sites over the Internet(and not over the Www which did not exist at that time ). The early commercial search engines for the Web documents were Lycos, Infoseek, Alta Vista and Excite which were launched around 1994-1995
5 and accessed. Web pages are created using special markup languages, such as HTML or XML that contain commands telling the browser how to display information contained in these pages. The markup languages also contain commands for linking the page to other pages, thus creating a hypertext environment that lets the Web user navigate from one Web page to another using these links (clicking on them) and thus letting the users to “surf” the Web. The development of the World Wide Web, Web documents and Web browsers for displaying these documents in a user-friendly fashion, made Internet much more userfriendly. This opened Internet to the less technologically savvy general public that simply wanted to display, access and exchange various types of information without resorting to complicated technical means that were needed before to achieve these goals. By developing the Web and thus making the tasks of displaying, accessing and exchanging information over the Internet much simpler, spawned the development of various types of websites that collect, organize and provide systematic access to Web documents. The number of these websites experienced explosive growth in the 1990’s and continued to grow rapidly worldwide up until now. Massive volumes of Web documents were created over a short period of time since the invention of the WWW. To deal with this information overload, it was necessary to search and find relevant documents among millions (and later billions) of Web pages spread all over the world among numerous websites. This gave rise to the creation and growth of search engines designed to search and find relevant information in the massive volumes of Web documents. 5. Growth of Search Engines and Google’s History A search engine finds information requested by the user that is located somewhere on the World Wide Web or other places, including proprietary networks and sites, and on a personal computer. The user formulates a search query, and the search engine looks for documents and other content satisfying the search criteria of the query. Typically, these search queries contain a list of keywords or phrases and retrieve documents that match these queries. Although the search can be done in various environments, including corporate intranets, the majority of the search has been done on the Web for different kinds of documents and information available on the Web. Since searching these documents directly on the Web is prohibitively time consuming, all the search engines use indexes to provide efficient retrieval of the searched information. These indexes are maintained regularly in order to keep them current. The history of search engines goes back to Archie and Gopher, two tools designed in 1990 – 1991 for searching files located at the publicly accessible FTP sites over the Internet (and not over the WWW which did not exist at that time). The early commercial search engines for the Web documents were Lycos, Infoseek, AltaVista and Excite, which were launched around 1994 – 1995. Case 1:08-cv-03139-RRM -RER Document 51-6 Filed 03/17/10 Page 5 of 47