Data Quality

January 25, 2021

Beyond Checking for Duplicates and IP Addresses: 10 Ways to Deep Clean Your Data

Data quality is a massive problem in the Marketing Research industry, especially given the dominance of online surveys. Bad data can lead to erroneous conclusions and lead organizations to incorrect actions, wasting time and money.

Bad Respondents Can Lead to Bad Decisions

The bottom line is that businesses depend on marketing research to provide reliable and accurate data to inform critical decisions. If that data is not dependable, then the decisions made could be disastrously wrong.

For example, a big-box firm has hired a marketing research partner to redesign its retail layout. They rely on the results for decisions about merchandising, shelf layout, and lighting as they redesign (or worse yet, new-build) their prototype store of the future. What if the results were wrong? The client has spent millions of dollars on creating the wrong prototype. And they will have to pay millions more when the mistakes come to light and the prototype needs a redesign.

But the ramifications of low-quality data are not always so obvious. For example, research has shown that low-quality respondents often overstate their brand awareness and familiarity. If an advertiser thinks they have 86% awareness and 75% familiarity, they will make very different decisions about their marketing tactics than if they have 24% awareness and 8% familiarity. Their marketing might address a trial problem when, instead, they actually have an awareness issue.

As an industry, marketing research has rarely been held accountable for these types of issues. Nonetheless, these outcomes result in lack of confidence in, and enthusiasm for, marketing research, which hurts the industry as a whole.

Data Quality is Everyone’s Problem

Whether you're the research provider, the sample provider, or the client, we all play a part in addressing this challenge.

First, checking on IP addresses for your sample is table stakes: every reputable company in the industry does this. At the time of login, your research partner identifies those located outside of the area you are researching or those IPs that are known to commit fraud and remove them. (For example, if you are surveying U.S. residents and the IP address is not located in the U.S.) It also helps remove respondents whose IP addresses are known to be fraudulent and are likely to be low quality. But few providers go much further than looking at IP addresses, and this is a mistake. You or your provider need to go much further to ensure data quality.

Here are additional checks to identify bad quality in your survey responses:

Time zone checks: All devices are set to a specific time zone. Check whether the respondent's time zone is the same as the one you are researching. For example, if your respondent should live in the U.S. and the computer is set to an East Asia or European time zone, you may have a problem.
Duplicate respondents: Multiple completed surveys from the same respondent are low quality and should be removed from your dataset. Sample providers often check the survey device's digital fingerprint to prevent this type of fraud.
Length of time to complete the survey: When respondents complete your survey much faster than your average respondent, there is a good chance the ”speeder” is not delivering quality responses. Take a look at the average survey time on your completed surveys; anyone who completes the survey too quickly (for example, in half the amount of the average time) should be flagged and their data reviewed for quality responses.
Open-ended questions: While this may take some time for the researcher or the sample provider, the quality of open-ended responses can be critical to optimal results and must be judged by a human. Therefore, someone must check to see if respondents give nonsense answers ("jfkjfkfjfk;") or give the same response to every open-end. It could become obvious that the respondent has no idea what they are talking about from that open-endresponse, and you can then remove them from the dataset.
Knowledgeable respondents: Include an open-ended question in your survey to verify the respondent actually knows something about the survey topic. For example, if you are surveying IT decision makers, ask an IT-related open-ended question.
Red-herring questions: Burying a fake response in a list of legitimate answers can quickly identify respondents who do not intend to provide quality information. (Question: Which of the following soft drinks have you purchased in the last month? Answer: Coke, Pepsi. Dr. Pepper, Mountain Dew, Rolling Thunder, Other) Respondents who answer these questions incorrectly leave the survey.
Trap questions: Like red-herring questions, trap questions direct the respondent to answer the question in a certain way. ("Select Do Not Agree as your answer to this question." or "Select Dog from this list of other animals.") Again, the survey should be programmed to end participation for respondents who answer incorrectly.
Bot questions: Researchers must also worry about bots completing surveys. Including questions like Captcha, images where one has to pick particular frames, or a math question can weed out bots, which are, of course, automatically deleted from the survey.
Straight-lining grid questions: Respondents who answer every question the same in a grid are called straight-liners. They should be removed from your dataset.
Contradictory data: Answers that lack internal consistency, or are illogical, are signs of a respondent who is not being thoughtful in their responses or simply should not be taking your survey. Again, this requires human review (and most likely from the client or subject-matter expert). Removing these respondents improves overall data quality.

To Remove or Not to Remove

Deciding to delete a respondent can be difficult when you are already having a tough time filling tough quotas. It is tempting to negotiate on a removal: Is one wrong answer enough to remove a respondent? Maybe they just let their attention slip momentarily?

There are three types of respondent deletions:

Automatic Deletes: Most of these are done with digital fingerprint software used by the sample provider (such as checking for troublesome IP addresses, incorrect time zones, and duplicate responses). These are typically done at login or in the screener and should never make it into the actual survey. Respondents failing traps, red-herring, and bot questions are normally removed immediately.
Judgement Deletes: Sometimes, there are situations where the respondent legitimately completes the survey faster than do others, but not so fast as to be deleted. (Some people do read faster than others.) Straight-lining can also fall into this category--maybe a respondent really does agree strongly with these statements. If a respondent has given imprecise open-end responses throughout the entire survey, they might be eliminated. Researchers might insert 2-3 low-incidence questions in a survey ("Have you flown a plane?" and "Have you traveled to Antarctica?”), where one would have to answer the combination correctly to avoid being eliminated.
Expert Deletes: These deletes are based on the client's or subject-matter expert's judgment. Poor open-end responses, lack of internal consistency, and decisions about speeding (but not speeding excessively below the average completion time) usually fall into this category.

As we said before, data quality is everyone’s problem – but it is not one we have to live with. It will take a team effort to fix this problem. Clients and Market Research companies should do their due diligence in writing a good screener and questionnaire, and including the items suggested above in the questionnaire. Sample providers should take a no-nonsense approach when cheaters are caught in their panel and remove them immediately to prevent a future problem. Clients should demand better scrutiny of the data and reporting on the sample efficiency from their providers. Providers should be transparent and candid about their low-quality respondents and advise clients on how to improve the data quality they are getting with their questionnaires. Together, we can address this issue – and everyone will benefit.