update.tex

% TEMPLATE for Usenix papers, specifically to meet requirements of
%  USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
%   written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
%   Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
%   both the above pioneers
% use at your own risk.  Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point

% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document.  Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file. 

% Note that since 2010, USENIX does not require endnotes. If you want
% foot of page notes, don't include the endnotes package in the 
% usepackage command, below.

% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass[letterpaper,twocolumn,10pt]{article}
\usepackage{usenix,epsfig,endnotes}
\usepackage{url}
\begin{document}

%don't want date printed
\date{}

%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf Mouse and KeyboardTracking Detection \\ Project Update}

%for single author (just remove % characters)
\author{
{\rm Nathan G. Feldman}\\%in case you're wondering, there's already a few other Nathan Feldmans on Google Scholar, so I need to distinguish myself in scholarly work
University of California, San Diego
\and
{\rm Scott Finney}\\
University of California, San Diego
% copy the following lines to add more authors
% \and
% {\rm Name}\\
%Name Institution
} % end author

\maketitle

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
\thispagestyle{empty}


%1. What closely related work, if any, did you find and read?
%2. What things did you try on the project since you proposed it?
%3. What findings did #2 lead to?
%4. How did the related work you read and your findings change your proposal, if at all?
%5. What is your plan for the remainder of the quarter?
%6. What, if anything, do you need from me and Keaton to help you succeed

\subsection*{Abstract}
We aim to design and implement a technique for detecting when websites store data about mouse movement and keyboard presses and send it out over the internet, either back to the website's own server or out to the server of a third-party service.  We would like to build a web crawler to visit as many of the top Alexa websites as possible and use this technique on each of them to find out which ones do mouse and keyboard tracking.

We wish to analyze our results to find out as precisely as possible what information each of them is sending, to whom, and for what purpose it is being sent. In doing so, we will find out how much the typical web users' expectations of privacy on the web are being violated.

Time permitting, we may also implement a browser extension to do detect mouse and keyboard tracking at the user's request.

\section{Goals}
The core of this project is to design and implement a technique for detecting mouse and keyboard tracking on the web. The technique should allow us to find out when a website's client-side script is sending information about the user's mouse movement and keyboard presses out over other the internet. Mouse movement, while not a particularly severe security concern, may violate the typical user's expecation of privacy on the web. Keyboard press detection raises the same concern, but may also reveal sensitive information about the user, especially her password, to malicious sites she visited by accident or by coersion. (Of course, any website where the user has to log in can see the user's password for that site, which is most likely used by the user for other sites as well. We will discuss in more detail in our final report why keyboard detection gives the adversary more of an ability to steal a user's password or other sensitive information.)

Once the technique is implemented, we will perform a web crawl applying the techique to some number of top Alexa websites. This should allow us to determine which of the most ``popular'' websites are tracking mouse movement and keyboard presses. Additionally, by collecting network traffic sent by the websites' client-side scripts, our data will indicate to whom the mouse and keyboard information is being sent.

After gathering the data from the web crawl, we will perform an analysis to try to find out how this mouse and keyboard information is being used. Our hypothesis is that this kind of tracking is being done quite widely, but mostly just for UX purposes. Web developers and UX designers have reason to learn exactly how users are using their site so that they may continuously rework the website design to make it easier to interact with. We would like to find out how many websites are using this information for illegitimate purposes. We believe we should gain some insight into the web developers' reasons for mouse and keyboard tracking by looking at the addresses to which the information is being sent. It is reasonable for the information to be sent back to the website's own server for usability testing purposes. Additionally, we are aware of some third-party mouse and keyboard tracking services that are commonly used by web developers for usability testing, such as ClickTale~\cite{clicktale}, Crazy Egg~\cite{crazyegg}, Inspectlet~\cite{inspectlet}, and Mouseflow~\cite{mouseflow}. %should those be endnotes instead of citations?

%Talk about what privacy violation means here?

\section{Related Work}
Our goal is similar in structure to fingerprinting detection. We wish to load a number of sites and somehow analyze their scripts to decide if they are tracking certain information from the browser.

Acar et al.~\cite{fpdetective} built a framework called FPDetective for crawling the web for instanances of web-based fingerprinting. They made slight modifications to the JavaScript engines inside the browsers being used for crawling in order to log certain events in JavaScript. They use heuristics to decide if a particular script is fingerprinting the browser. Primarily, they suggest that if a script attempts to load a large number of fonts, they are probably fingerprinting the browser, since there are likely no other reasons a website would have for loading that many fonts. They used PhantomJS~\cite{phantomjs}
a ``headless'' web browser, in order to do a load each of the top million Alexa sites and determine based on the behavior of the loaded scripts whether they were performing any fingerprinting. They found that 404 of the top million sites were doing fingerprinting in JavaScript, and an extra 95 sites of the top 10,000 were doing Flash-based fingerprinting (found through separate methods).

Mayer and Mitchell~\cite{fourthparty} rely on a framework called FourthParty, a Mozilla Firefox extension that also does JavaScript event logging for web measurement, in order to inform a discussion on web tracking policy. The framework is more general-purpose but less powerful than FPDetective, in that it is meant to be used for any type of measurement and analysis done over the web, but that is only a browser extension and not a browser modification.

Jang et al.~\cite{tainttracking} built a much more powerful tool to detect several forms of malice in JavaScript, such as cookie stealing, history sniffing, and mouse and keyboard behavior tracking, which was run over the top 50,000 Alexa sites. The tool is actually a modified version of Google Chrome instrumented with an ``information flow engine,'' which which is capable of tracking how accessed data values are passed through the script. For instance, it is capable of ``tainting'' the $x$- and $y$-coordinate values taken from a mouse move event and detecting if those values are ever sent to a method that sends an HTTP request out over the network. Mouse movement and key presses were not their focus, however. In fact, the only events whose values they discuss in their analysis are clicks, mouseovers, and clipboard copies. While they do have found evidence of mouse event values being passed to network functions, they do not discuss how basic mouse movement or keyboard press information is being used, or if it is being used at all.

While building upon the taint-tracking system from ~\cite{tainttracking} would have been an excellent idea for our experiments, the code was not in a working condition, and fixing it and building upon it would have taken more work than we could afford in one school quarter. We could have built on top of FPDetective or FourthParty as well, but both were built on versions of browsers that are now outdated, and there would have been a lot of overhead in getting them set up and getting acquainted with their codebases.

Instead, we felt our technique (described in the next section) was simple enough that building our own crawler directly on PhantomJS would have been the quickest and easiest way to produce results by the end of the quarter. This also guarantees us the ability to collect exactly the data that we need and use the external tools we want, in case the frameworks mentioned above confine us to using specific API calls. Finally, by using our own, single-purpose script, our web crawl and analysis should run faster without the overhead of all the other measurements taken in the frameworks above.

\section{Technique and Set-up}

We are writing a PhantomJS script that visits a target webpage, triggers mouse and keyboard events, and records relevant network traffic. The script will be run over as many of the top Alexa sites we have time for, aiming for at least the top 10,000. The network requests and responses will be stored in a MongoDB database~\cite{mongodb}, to be processed by a separate scripts for analysis after the data collection is complete. This network traffic analysis is meant to detect when a site is performing covert mouse and keyboard tracking.

Essentially, our heuristic is look for correlation between mouse and keyboard events and network traffic. We first filter out the first several hundred milliseconds of HTTP requests and responses, since this is only going to be the initial page and resource load, and we will not trigger any mouse or keyboard events during this time. After the initial loads are done, we hypothesize that there will only be two types of network traffic left. The first type is periodic background traffic. This includes messages to let the server know the user is still online, and requests for updates to the page content. Because the browser does not allow the server to send messages to the client without a request from the browser, content cannot be directly ``pushed'' to the client once it is created, it must be explictly asked for by the user. As an example, a social networking site might have the browser send HTTP requests to the server every five seconds to ask for new posts to the user's news feed, new messages and notifications, and just to let the server know that the user is still online for chat. This all happens because the server is not capable of sending messages to the browser once new posts are created or messages or notifications are received. We suspect that almost all of this type of HTTP requests are only sent at fixed intervals and with roughly constant size.

The second type of traffic is that triggered by user interaction. Because all user interaction is controlled by our scripts, we should only see variations in the frequency and size of HTTP requests sent the browser after the initial loads as a result of our triggered mouse and keyboard events. Therefore, we should be able to detect a simple pattern of network traffic over the first few seconds after the initial loads. After this, we should only see a disruption in the pattern as a result of mouse and keyboard events. Moreover, we think that patterns in sequences of mouse and keyboard events will result in detectable patterns of HTTP requests if the site is performing some type of mouse or keyboard tracking.

We think that there is no reason for mouse and keyboard events to trigger network traffic in this fashion except for covert tracking, unless the page changes as a result of the network traffic. For instance, as mentioned in~\cite{tainttracking}, moving the mouse to certain locations could cause an image to load, and so this would not be covert. In order for this to be the case, the server needs to send something back to the browser to be loaded into the page. Thus, we will filter out sites where most HTTP requests of the second type result in large HTTP responses from the server. We deduce that all remaining sites with the second type of traffic are doing covert tracking. In fact, we don't actually expect to find many of these false positves where HTTP responses resulting from mouse and keyboard events contain real substance because we believe that most traffic of this type will come from mouseover events, which we are not generating.

\section{Plan}

We have already written PhantomJS scripts for triggering mouse events on a site and logging all network traffic into a database.

This week (by February 28) we plan to finish scripting our web crawl and have it run on the top however many Alexa sites, triggering mouse events and recording the network traffic.

This weekend and next week, we will focus on implementing our algorithm to detect when the recorded network traffic seems to indicate that a website is performing covert mouse and keyboard tracking, and running it over our database, generating a list of sites we believe to be doing this. We will then inspect some of our results manually to verify that these are indeed doing covert tracking, to investigate the purpose of the tracking, and to identify major third-party services that do mouse and keyboard tracking.

\section{Things We Need from You}
Advice and moral support.

(Computing resources might be useful for our crawl, but we don't think this will be necessary.)

{\footnotesize \bibliographystyle{acm}
\bibliography{bib}}


\theendnotes

\end{document}