ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2013)
By: Ori Stitelman, Claudia Perlich, Brian Dalessandro, Rod Hook, Troy Raeder, and Foster Provost
Data generated by observing the actions of web browsers across the internet is being used at an ever increasing rate for both building models and making decisions. In fact, a quarter of the industry-track papers for KDD in 2012 were based on data generated by online actions. The models, analytics and decisions they inform all stem from the assumption that observed data captures the intent of users. However, a large portion of these observed actions are not intentional, and are effectively polluting the models. Much of this observed activity is either generated by robots traversing the internet or the result of unintended actions of real users. These nonintentional actions observed in the web logs severely bias both analytics and the models created from the data. In this paper, we will show examples of how non-intentional traffic adversely affects both general analytics and predictive models, and propose an approach using co-visitation networks to identify sites that have large amounts of non-intentional traffic. We will then show how this approach, along with a second stage classifier that identifies non-intentional traffic at the browser level, is deployed in production at Media6Degrees (m6d), a targeting technology company for display advertising. This deployed product acts to both filter out the spurious traffic from the input data and to insure that we don’t serve ads during unintended website visits.