Observational Studies, Surveys, Quasi-Experiments, and Experiments

Across the sciences, researchers use a spectrum of tools or “instruments’’ to collect information and then make inferences about human preferences and behavior. These tools vary in the degree of control the researcher traditionally has had over the conditions of data collection. Surveys are an instance of such an instrument. Though widely used across social science, business, and even in computer science as user studies, surveys are known to have bugs. Although there are many tools for designing web surveys, few address known problems in survey design.

They have also traditionally varied in their media and the conditions under which they are administered. Some tools we consider are:

Observational studies: Allowing no control over how data are gathered, observational studies are analogous to data mining – if the information is not readily available, the researcher simply cannot get it.

Surveys: The next best approach is to run a survey. Surveys have similar intent as observational studies, in that they are not meant to have an impact on the subject(s) being studied. However, surveys are known to have flaws that bias results. These flaws are typically related to the language of individual survey questions and the structure and control flow of the survey instrument itself.

True Experiments: If a research is in the position of having a high degree of control over all variables of the experiment, they can randomly assign treatments and perform what is known as a “true experiment”. These experiments require little modeling, since the researcher can simply using hypothesis testing to distinguish between effect and noise.

Quasi-Experiments: Quasi-experiments are similar to true experiments, except they relax some of the requirements of true experiments and are typically concerned with understanding causality.

In the past, there has been little fluidity between these four approaches to data collection, since the media used to implement each was dramatically different. However, with the proliferation of data on the web and the ease of issuing questionnaires on such platforms as facebook, SurveyMonkey, and Mechanical Turk, the implementation of these studies of human preferences and behavior have come to share many core features.

Despite similarities between these tools, quality control techniques for experiments have been largely absent from the design and deployment of surveys. There has been an outpouring of web tools and services for designing and hosting web surveys, aimed at non-programmers. While there are some tools and services available for experiments, they tend to be domain-specific and targeted to niche populations of researchers. The robust statistical approaches used in experimental design should inform survey design, and the general, programmatic approaches to web survey design should be available for experimental design.

[Read More]


Surveys : A History

What is a survey?

Everyone has seen a survey – we’ve all had the customer satisfaction pop-up on a webpage, or have been asked by a college student PIRG worker to answer some questions about the environment. We tend to think of surveys as a series of questions designed to gauge opinion on a topic. Sometimes the answers are drawn from a pre-specific list of options (e.g. the so-called Likert Scales; sometimes the response is free-form.

What distinguishes surveys from other, similar instruments is that surveys (a) typically return a distribution of valid responses, and (b) surveys are observational.

Or rather, surveys are supposed to be observational. That is, surveys are meant to reveal preferences or underlying assumptions, behaviors, etc. and not sway the respondent to answer in one way or another. There are some similar-looking instruments that are not meant to be observational. Some of these instruments fall under the umbrella of what’s called an “experiment” in the statistics literature.

Why surveys?

Perhaps in the future we won’t have a need for surveys anymore – all of our data will be floating around on the web, free to anyone who wants to analyze it. If, after all, surveys are really about observational studies, we should be able to just apply some clustering, learn a model, do some k-fold validation, etc.

There are many problems when attempting to just use data available in the wild. First of all, though we may be in the era of “big data,” there are plenty of cases where the specific data you want is sparse. A worse situation is when the sparsity can be characterized by a Zipfian distribution - depending upon how you set up your study and what your prior information is, it’s possible that you will never sample from the tail of this distribution and may never know that it is sparse. This leads us to a second problem with simply mining data : we cannot control the conditions under which the data is obtained. When conducting a survey, researchers typically use probability sampling (the popularity of convenience sampling for web surveys will be discussed in a later post). This allows them to adequately estimate the denominator and estimate error due to people opting out of the survey (so-called “unit nonresponse”).

Finally, conducting a survey explicitly, rather than mining data gives the researcher a more complete view of the context of the survey. As we will discuss later, understanding context is critical and can lead to unpredictable responses.

A brief history of survey modes

While simple surveys such as a census have been around for thousands of years, the customer satisfaction or politic survey of today is a more recent development. Market research and political forecasting are products of capitalism and require access to resources to conduct and make use of surveys. Survey research is intrinsically tied to the technologies used to conduct that research and the statistical methodologies that are available and understood gat the time the survey is conducted. Before mail service, a survey would have to have been conducted in person. Although random sampling is a very old idea, it was not until Laplace that tight bounds were calculated on the number of samples needed to estimate a parameter of a population.

Centralized mail service helped lower the cost of conducting surveys. The response time for surveys was lowered with widespread adoption of telephones. Mail and telephone surveys dominated survey modes in the latter half of the 20th century. Since landline telephones are associated with an address, these now-traditional survey modes relied on accurate demographic information. The introduction of the World Wide Web and increasingly wide-spread use of cellular phones prompted survey designers to reconsider traditional instruments in favor of ones better suited to growing technologies.

A 2002 paper from the RAND corporation describes growing interest among researchers about using the Web to conduct surveys. The paper addresses the assertion that internet surveys have higher response rates than traditional mail or telephone surveys. They found this to not be the case, except for technologically savvy populations (e.g. employees of Bell Labs). However, they noted that the web is only going to become more pervasive and they recommended that survey designers keep the web in mind.

The RAND paper describes web surveys not as web forms, but simply as online-distributed paper surveys. The surveys were sent over email, in a model that exactly mirrors mail surveys. They noted that spam could be an issue over time and specifically stated that it was possible that the populations with higher response rates to emailed surveys were those who may have had a lower junk-to-relevant mail ratio for email than for snail mail.

The view of web surveys from twelve years ago predated “web 2.0”. It also predated the widespread use of cellular phones. Six years ago Pew Research published an article on the growing proportion of cell-phone-only households. They found that this population still only comprised a small proportion of the total US population, and so for polls that targeted the entire US population, unit nonresponse from cell phone users could be explained by typical error estimates. However, cell phone users were found to have a distinct population profile from the total US population. Therefore, any stratified sampling needed to take cell phone users more seriously.

Why Web Surveys?

As technology changes, the mode used to collect survey information changes. Clearly the rising use of smart phones makes web surveys increasing attractive for researchers.

On top of the obvious appeal of being able to reach more people, web surveys afford researchers unique advantages that other modes do not allow, or are prohibitively expensive to implement. Web surveys do not require people to administer them (while many organizations use automated calling services for phone surveys, the is still an associated cost for the service, as well as growing discontent for robocalls (okay, that still supports argument one).

Web surveys allow for rapid design modification, cheap pilot studies, and (what we believe to be most important) the ability to control for known problems in survey design. What problems could there be, other than not being able to reach people? Consider the dominant view of survey design:

The goal is to present a uniform stimulus to respondents so that their responses are comparable. Research showing that small changes in question wording or order can substantially affect responses has reinforced the assumption that questions must be asked exactly as worded, and in the same order, to produce comparable data. (Martin 2006 )

We believe that this view -- this static view -- of survey design leads to overly complicated models and cumbersome statistical analyses that arise solely from only being able to perform post-hoc data analysis. Our view is that, since there are so many variables that may affect the outcome of a particular survey response, we should not try to control for everything, since controlling for everything is impossible. Instead, we randomize aspects of the instrument over a population, promote a "debug phase" akin to pilot studies, and encourage easy replication experiments. If we can reduce known biases in survey design to noise, we can perform more robust analyses. Now what do more robust analyses give us? Better science! Who wouldn't want that?

[Read More]


SurveyMan's debut

Dear Internet Diary,

This past weekend, we presented the SurveyMan work for the first time, at the Off the Beaten Track workshop at POPL. I first want to say that PLASMA seriously represented. We had talks in each of the sessions. Though I didn't have the chance to see Charlie's talk on Causal Profiling, Dan said it definitely engendered discussion and that people in the audience were "nodding vigorously" in response to the work. Dimitar presented Data Debugging, which people clearly found provocative.

I was surprised by the audience's response to my talk; I know Emery had said that people whom he talked to were excited about this space, but sometime that's hard to believe when you're a grad student chugging away at the implementation and theory behind the work. It was invigorating to be able to describe what we've done so far and hear enthusiastic feedback. In all my practice talks, I had focused on the language itself, but for OBT, at the behest of my colleagues, I took the debugging angle instead. Most of the people in the audience had used surveys for their research and were quite familiar with these problems. While language designers have tried to tackle surveys before, they frequently come from the perspective of embedding it in a language *they* already use. The approach we take leverages tools that our target audience uses. We limit the expressivity of the language and make statistical guarantees, which is what our users care about the most.

I had a few really interesting questions about system features. Someone made the point that bias cannot be entirely removed through redundancy -- that we can't know if we've found enough ways of expressing a question to control for the underlying different interpretations. In response, I suggested that we could think about using approaches from cross-language models to determine whether we have categorically the same questions. The idea is that if a set of questions produces the same distribution of responses, it is sufficiently similar. Of course, this approach neglects the non-local effects of question wording. Whether or not this can be controlled through question order randomization is something I'll have to think about more.

As a followup question, I was also asked if we could reverse-engineer the distributions we get from the variants to identify different concepts. This was definitely not something I had considered before. I wasn't sure we would, in practice, have sufficient variants and responses to produce meaningful results, but it's something to consider as future work.

A lot of the other questions I had were about features of the system that I did not highlight. For example, I did not go into any detail about the language and its control flow. I was also asked if we were considering adding clustering and other automated domain-independent analyses, which I am working on right now. Quite a few of the concerns are addressed by our preference for breakoff over item-nonresponse. There was also an interesting ethics question about using our system to manipulate results. Of course, SurveyMan requires active participation from the survey designer; the idea is not to prevent the end-user from adding bias, but to illuminate its presence.

[Read More]