As the market has come to realize that the ability to communicate with and track actual consumers—as opposed to their various devices—is increasingly important, the use of probabilistic device matching data has been growing rapidly among advertisers, publishers and demand-side platforms.
Despite being a frequently-discussed topic in ad tech, many marketers and executives remain unsure about exactly what probabilistic matching is and how it differs from deterministic matching. This article aims to clarify the issues.
Deterministic Device Matching
Let’s start with deterministic matching: this is a method of determining that several connected devices (such as a smartphone, tablet and laptop) all belong to the same person, based on that person’s unique login credentials for a particular service. Once the same login ID (such as email address or username) has been used to log in to a service on more than one device, there is a good chance that both devices belong to one particular person. This is especially true if the same ID is used to log in to those different devices multiple times (as opposed to a one-time login which might occur, for example, when checking webmail on a friend’s computer or at a public computer station while traveling).
Deterministic matching is very accurate, but it has limited reach in terms of the overall population. This is because the size of the registered user base of any given service determines the maximum number of consumers whose devices can be linked. There are, of course, companies with massive numbers of users who consistently log in with the same username across devices, but even those huge companies (such as Google and Facebook) do not cover all consumers. Furthermore, a large and increasing number of Facebook users, for example, only log in using mobile devices. For these reasons and others, even the behemoths resort to probabilistic matching to “close the gap.”
It is also important to mention that because the data used for deterministic matching contains personally-identifiable details, there is the very real, always-present threat of a damaging data breach for companies maintaining this kind of data.
Probabilistic Device Matching
So, what is probabilistic matching? Probabilistic matching also determines which connected devices belong to a single consumer, but it does not rely on explicit login information to do so. Instead, probabilistic matching leverages principles in the areas of mathematics, data science and machine learning to analyze huge amounts of device activity data to infer, or predict, that various devices belong to an individual consumer. By analyzing the observed activity patterns of billions of different devices over time, it is actually possible to accurately determine which devices must belong to the same person.
The mathematical processes used by probabilistic method do not result in binary (yes/no) indications of device relationships. Rather, the method indicates the probability that various devices are linked by an individual user. Hence, this method is called probabilistic matching. It is up to the user of probabilistic matching data to decide, for each particular use case, which probability threshold makes the most sense for that use case (more about this later).
The two primary advantages of probabilistic device matching over deterministic device matching are scale and privacy. Scale means that probabilistic matching solutions can potentially identify the devices of all consumers, not just those consumers logging in to one particular service. This is critical in order for advertisers and publishers to maximize the value of device matching data across the entire population of consumers.
In terms of consumer privacy, probabilistic device matching is superior because all collected device and activity data is anonymous. Unlike deterministic matching, which relies on identifying devices via an actual email address or other personally-identifiable ID, probabilistic matching solutions need only identify which devices are used by an individual without ever knowing or caring who that individual is. Therefore, there is no risk of a damaging data breach if hackers should gain access to this kind of data.
Note that the probabilistic techniques we’re discussing are used in a wide range of disciplines in which data scientists attempt to model massive amounts of data to make predictions. Other fields where the probabilistic approach is used include economics (to predict consumer behavior and other macroeconomic trends), stock market analysis (to predict market trends and individual share prices), meteorology (to forecast the weather) and pharmaceutical research (to predict the effects of new drugs).
How Probabilistic Device Matching Works
The probabilistic matching method involves a series of four logical steps.
First, the system acquires device activity logs from as many devices as possible within a targeted geographical area (typically, from billions of devices). These logs include information such as IP address, WiFi networks used, GPS coordinates, websites browsed, ads displayed, device type, operating system, browser cookies, mobile device IDs, time of day and many more. The more detailed the activity data, the better. This type of activity data is called, “observation data.”
Next, the system analyzes anonymous deterministic matching data on a small subset of the observed devices. This small deterministic dataset, which is usually purchased from a third-party service that collects it, indicates which devices belong to the same person (without identifying the person). This data is used both for training the machine learning models, which predict whether or not certain devices belong to the same person, and for validating the models’ results. This deterministic data has various names, including “labels,” “ground truth,” “deterministic set” and “truth set” (we’ll use “truth set” for the rest of this article). It is very important that this dataset is based on actions which indicate that the linked devices actually belong to the same individual person, and not, for example, just to the same household. The larger and more diverse this dataset is, the better the probabilistic model will perform.
The third step is to train the machine learning models using the truth set so that they can analyze the observation data and successfully determine which devices belong to individual users. Delving into the algorithmic complexities of how this is actually accomplished is beyond the scope of this article, but suffice it to say that accurately training machine learning models is an immense challenge, one that only the most talented teams of data scientists and software developers can address. Further complicating the challenge is the fact that consumer usage patterns are constantly changing and evolving; thus, this machine learning process is an ongoing and never-ending aspect of any reliable probabilistic matching solution.
To facilitate unbiased measurements of the models’ performance, the truth set needs to be split into two distinct, non-intersecting sets. One set is used for the actual training of the models and is called a “training set,” while the second set is used for testing the performance of the models and is called a “test set.” By splitting the truth set into two, unbiased performance evaluation is ensured, because the models have never “seen” the portion of the data used to test their performance. (Read Probabilistic Device Matching Accuracy, Precision and Recall to learn more about how the performance of probabilistic matching solutions is measured.)
In the fourth and final step, the system analyzes an entire observation dataset and generates a list of devices that it has determined are linked to each other by an individual user. As mentioned previously, the model assigns a probability score to each of these identified device links.
Each particular use of the data (cross-device targeted advertising, cross-device retargeting, audience extension, cross-device content personalization, cross-device attribution analytics, etc.) will use its own pre-defined threshold for deciding whether two devices belong to the same person or not. For example, a device-match probability threshold of 60% might be used for ad targeting or content personalization (where there is little downside to “false positives”), whereas a threshold of least 80% might make more sense for cross-device conversion attribution (which may influence advertising spending decisions).
In summary, a well-performing probabilistic matching solution generalizes what it learned from a training set and successfully applies the resulting algorithms to any observation data. This is the “magic” of probabilistic matching. A sophisticated, carefully tweaked model will be able to deliver results which demonstrate very high accuracy along with great scale.
How Probabilistic Matching Data is Used
There are numerous uses for probabilistic device matching data, all of which represent significant added value to advertisers and publishers. Click to learn more about any of the following examples:
Advertising-related Use Cases
- Cross-device advertising – Target known users via additional devices
- Segment expansion – Expand segments to include all relevant screens
- Cross-device retargeting – Capture a customer when timing is crucial
- Global frequency capping – Limit ad views across all of a user’s devices
- Sequential messaging – Run effective cross-screen sequential campaigns
Personalization-related Use Cases
- Content personalization – Personalize content across all a user’s devices
- Offer personalization – On-site/in-app offers based on cross-device interests
Analytics-related Use Cases
- Cross-device behavior analytics – Understand multi-touch user flow across devices
- Cross-device conversion attribution – Understand when multiple devices are involved in a conversion
- True unique reach and frequency – Count people, not devices