Ignoring Robots - Or Better Yet, Counting Them Separately
- by [email protected]
It is quite common to have web sessions that are undesirable from the point of view of analytics. For example, when there are either internal or external robots that check the site's health, index it or just extract information from it. These robotic session do not behave like humans and if their volume is high enough they can sway the statistics and models.One easy way to deal with these sessions is to define a partitioning variable for all the models that is a flag indicating whether the session is "Normal" or "Robot". Then all the reports and the predictions can use the "Normal" partition, while the counts and statistics for Robots are still available.In order for this to work, though, it is necessary to have two conditions:1. It is possible to identify the Robotic sessions.2. No learning happens before the identification of the session as a robot.The first point is obvious, but the second may require some explanation. While the default in RTD is to learn at the end of the session, it is possible to learn in any entry point. This is a setting for each model. There are various reasons to learn in a specific entry point, for example if there is a desire to capture exactly and precisely the data in the session at the time the event happened as opposed to including changes to the end of the session.In any case, if RTD has already learned on the session before the identification of a robot was done there is no way to retract this learning.Identifying the robotic sessions can be done through the use of rules and heuristics. For example we may use some of the following:Maintain a list of known robotic IPs or domainsDetect very long sessions, lasting more than a few hours or visiting more than 500 pagesDetect "robotic" behaviors like a methodic click on all the link of every pageDetect a session with 10 pages clicked at exactly 20 second intervalsDetect extensive non-linear navigationNow, an interesting experiment would be to use the flag above as an output of a model to see if there are more subtle characteristics of robots such that a model can be used to detect robots, even if they fall through the cracks of rules and heuristics.In any case, the basic and simple technique of partitioning the models by the type of session is simple to implement and provides a lot of advantages.