DS Without The BS Episode 1: Data Quality
Welcome to our new podcast series, “DS Without The BS,” where Caroline Allen, Content Manager, helps to demystify AI, Machine Learning, and Media for marketers and market researchers. Gilad Barash, Director of Analytics, joins our first episode for a conversation on the top concern for B2B and B2C marketers: Data Quality. What does it mean for transparency, privacy, and accuracy? Listen to find out.
Welcome to the third podcast from Dstillery, a predictive marketing intelligence company. I’m Caroline Allen, Marketing Manager and host of our channel. In our first two episodes, we discussed the 2018 election and how our data scientists were able to predict the outcome from PA-18. As I spent so much time with our data science team over the past few weeks, I realized that as a marketer how lucky I am to be surrounded by an entire company who can help to demystify all of the concepts and applications around AI, machine learning, and media. Because I’m a marketer, like you, I know that it’s information you’re dying for access to, so I invite you to tune into “DS Without The BS”, a series where we lift the veil on how you can and should use data science, AI, and machine learning to discover new customer and drive growth for your brand.
In our first episode, I’m joined by Gilad Barash, Director of Analytics, and we’re talking about data quality, which an Ascend2 study identified as the top challenge for marketers in 2018. Gilad, in meetings with our clients and partners, you’ve been hearing a lot of questions around data quality and how it pertains to privacy, transparency, and accuracy. Can you start by telling us what is your definition of data quality?
First of all, I’m very happy to be here, and thanks for having me. The definition of data quality for me is data that is accurate and usable in order to ultimately show the right ad to the right person at the right time. This pertains not only to the quality of the data, but also pertains to the quality of the media and the brand safety. All of these terms are sort of intertwined, and data quality is something that enables us to achieve these other objectives.
When we look at our data, it’s very important for this data to be first of all transparent, that we understand where it’s coming from and what it shows us, and it’s something that we can also express to our clients. We also really look at the issue of fraud, and that’s a very big issue when working with data, especially in ad tech. Fraud is a multi-billion dollar issue that happens annually. Anything from bot traffic to domain spoofing where you have certain fake ad spaces that are sold as premium inventory. We look to weed out all of this fraud and not utilize it into our modeling and certainly not into our targeting so we ensure that human eyes are the ones seeing the media that is being served and also that the consumer insights and the behavior patterns that we research and understand are human behavioral patterns and not ones of fraudulent bots.
When you’re looking at this data that’s coming through our system, are there any telltale signs that you see where you can say, “Oh, this is clearly bot traffic,” or, “Oh, this is clearly human traffic, human eyes, seeing the ads”?
That’s a really good question, and the answer is yes and no. There are cases where yes, some data quality issues are very easy to spot. For example, what we noticed when we started using mobile data that has geo-location in it, latitude longitude points … Let me ask you a question. Let me ask you a riddle. Do you know if you only looked at mobile data, mobile signal that comes from mobile bid requests that have lat/longs attached to them, that have geo-location, based on just looking at that data, what is the most populous area in the United States?
I would say maybe … We’re in New York. I would say maybe Times Square.
That’s a very good, intuitive answer, but the truth that just based on this mobile location data, the most populous area in the United States is actually at the center of a cornfield in Kansas. The reason for that is that a lot of times that geo-location in the mobile bid request is populated with just default or random values because there’s an incentive for developers in apps to include geo-location. It’s not important for their applications, and so they could just put in default values or random values. One very popular default value to put in the geo-location of mobile bid requests is the geographical center of the United States, which happens to be a cornfield in Kansas.
Things like that are easily detectable. We can see that whatever geographical centers of states, these are things that we see a lot of. We weed those out. Another example of bad data that we see is when we see a certain mobile device that one minute appears in Times Square in New York and the next minute we get another mobile bid request from it, and it’s in Anchorage, Alaska. That is impossible for that person to have gotten there that fast, and so we realize that these are just random values apparently that get placed into that geo-location, and so we weed them out and we get rid of them.
In the web space, there’s also certain types of fraudulent activity that’s easily detectable by bots. When we see certain co-visitation of websites that happens very, very frequently and ones that don’t really make sense that you would visit together, those indicate that this is fraudulent traffic, and we’re able to … We have patents around being able to weed out that kind of fraudulent traffic.
I will say, though, that especially with fraud in ad tech, this is an arms race, meaning that there’s new types of fraud invented all the time. We’re always trying to combat, and we will find ways to combat certain fraud. We’ll detect it. We’ll be able to weed it out, but then there’ll be new types. We’re always on the watch always looking out for irregular behavior and trying to see where we can identify the fraud and weed it out from our data.
The reason I say that this is an arms race is that it’s difficult to predict new, abnormal behavior. We know of certain types of fraud and bot behavior that happens based on historical behavior. We can predict that, and we can automate mechanism that will look for that and prevent that or isolate the cookies or the websites that are involved in that. However, you can’t predict something that you’ve never seen before. Behavior that you’ve never witnessed before. This is what happens when new fraud elements happen. We have to be vigilant and keep an eye on our systems to see if any new type of abnormal behavior happens.
For example, a couple years ago we had a website that we suddenly noticed that, in a two week period, we saw a spike in bid requests that were way beyond what it had ever had before. One of the exchanges that those bid requests were coming from had more bid requests than all of the other exchanges combined. More than any. The spike was so huge, we noticed this abnormality. Digging into it, we realized that this was Methbot. This was spoofing that was happening, and we were able to counteract it.
We also have to be vigilant and keep an eye on the system, identify and research and analyze any abnormalities that happen, and then find ways to deal with that fraud moving forward.
You mentioned just now about the patent that Dstillery has. Can you give a little detail to our listeners about what those patents are?
Yes. We have numerous patents around modeling and data collection and data quality. Those pertain to some of those examples that I mentioned of seeing mobile data that seems to be bad quality in the geo-location. In the web space, one of our big patents is what we call “The Penalty Box” which is, by the way, open source and freely available to all our competitors because we believe in having a level playing field. We believe that if everybody did everything they could to battle fraud, we would be in a much better state in our industry.
Essentially what this patent does is it identifies cookies in websites that we suspect are fraudulent based on these co-visitations patterns that seem irregular, and it isolates them from our entire data and modeling pipeline so that they’re not used as input for modeling in order to create our audiences, and they’re also not used for targeting and for insights to understand people’s behavior.
We talked a lot of about bot traffic and fraud as it pertains to data quality, but another aspect of that is the process of collecting this data and how that pertains to making sure that we are protecting the privacy of the consumer.
First of all, you know you mentioned privacy, and I think that’s certainly a very germane point right now that everybody’s talking about. I think it’s important to mention that good quality data doesn’t necessarily have to violate any kind of privacy issues. We use data that is transparent and anonymous. Also, any kind of data that violates PII standards. PII is personally identifiable data. That’s information that could be used on its own or with another information source to identify or locate a single person. We do not use this kind of data. We don’t want to use this kind of data.
We like to say here at Dstillery that the demographics just don’t tell a good story.
It really doesn’t. That’s right. It tells a very, very partial story. I have an example of something similar to what you mentioned. I’m always reminded of a certain car manufacturer that wanted to target men over 50 for their luxury brand. They were very dead set on that demographic. Male, over 50 was their customer. When we showed them our data, when we showed them the behavioral data that we looked at based on people’s web behavior, we saw that the behavioral patterns also indicated that there were their wives, their partners, their children, their friends who were also looking and researching and perhaps influencing them. When you target solely based on demographics, you actually lose out on that audience, on that potential influencer audience that may also be the outlier in your customer base.
Using only demographics will cause you to miss out on that. Looking at behavioral patterns, you will include the men over 50 that are looking at this and are buying this, but you will also include other demographics that you may not even have realized were interested, influence, research, and are involved in or are interested in your brand.
That’s a great point. To be frank, from Dstillery’s perspective, it’s not that we are the “boy scouts”. It’s that we truly understand the value of good quality data, not just to be able to serve media, but to be able to provide accurate and actionable insights to brands you have to be really picky about the data that you’re using.
Right. Exactly. We want to use the best data that we can. This happens to be data that is anonymous and transparent. Data that we can use to aggregate and understand behavioral patterns of audiences rather than specific individuals. That’s where our strength lies, and that’s what we are trying to do with the data. We look at this focus group of 300 million devices that we see in our country and are able to see observed behavior of what they do, aggregate that, and understand their behavior patterns without having to drill down to the level of the individual device. In fact, like you said, the reason we don’t want to use demographic data is because we believe that demographic data doesn’t tell a good story.
One of the top topics around our lunch table these days is blockchain. Do you have an opinion or some thoughts that you can share on this emerging technology and how it relates to data quality?
We’re always looking at new technologies that are out there on our radar. New applications of AI and machine learning as well as blockchain technology that’s been getting a lot of buzz recently as something that can really help with brand safety, media quality, and really transparency around ad tech because it builds a shared accountability across media partners to better ensure getting that right message in front of the right person at the right time. It allows us to coordinate and come to a trusted conclusion around data. That said, it’s new and emerging. It also has issues of its own. It doesn’t answer all of the questions, and it raises a few new ones. It’s something that we’re looking at that’s on our radar. Then we will continue to evaluate how well it can be integrated into our processes in the future, if it makes sense.
If you guys want more information on the work that we’re doing here at Dstillery or if you have questions for Gilad around data quality and brand safety, you can e-mail him directly. His e-mail is firstname.lastname@example.org. That’s G-I-L-A-D @dstillery.com. It’s also going to be in the transcript below. You can also check out our website, dstillery.com. Our blog has all of our recent articles as well as more information on brand safety and privacy. Follow us on Facebook, Facebook.com/dstillery.intelligence, or hit us up on Twitter or Instagram @dstillery. Don’t forget, that’s dstillery without the I. D-S-T-I-L-L-E-R-Y. Thanks guys. Talk to you soon.
Correction: Dstillery’s patent was incorrectly referred to as “penalty box.” The open-sourced patent is “co-visitation patent.”