# Metrics to answer our research questions

We had a number of excellent questions and comments from our reviewers. I've been working on a few entries to address them.

One of our reviewers pointed out that we didn't provide metrics for our research questions. Another pointed out that our results were qualitative, but that this was okay because we provided case studies, rather than user studies. Since we'd like to be able to do user studies in the future (hopefully the exposure we'll get at OOPSLA will lead to more collaborations!), it's certainly worth thinking about.

### Research Question 1: Is SurveyMan usable by survey authors and sufficiently expressive to describe their surveys?

**Usable by survey authors?** We think so. Certainly our colleagues in linguistics didn't have any problems with the CSV structure and the PL-ness of it. We were warned a bit more by our colleagues in econ. The argument I've heard is that (a) people are accustomed to flashy interfaces and would put off by using something text-based and (2) the formatting of the data, both the input and output, is counter-intuitive. I'm not entirely surethis is true. First of all, people have been using SAS and SPSS for years. When I took my first stats course in Spring 2004, we used one of those tools (too long ago to remember which). I remember the painstaking data entry involved. The summaries spit out, full of text tables (no graphics) weren't the easiest to interpret. This was the standard tool for statistical analyses in the social sciences at the time (I was an econ major then).

This project has been a challenge \emph{because} less-powerful, flashy alternative exist. Our colleagues in linguistics liked the csv idea because they were already storing their data in csvs (since they're easy to load into R and since they can -- and already were -- programmatically generate the data). It's unclear whether people in fields that are used to generating these things by hand will be as receptive. There's no reason I can see why someone wouldn't be able to write in our csv language, but drag and drop doesn't require reading a spec, and it's what most commercial services are using.

While I think the language is usable as-is, it would be more usable if we had a spreadsheet plugin to statically check the thing. How would we measure the usability of the tool, empirically?

I think one way of evaluating its usability would be to gather a bunch of surveys and recruit people for a user study. A traditional, offline version of this might go as follows:

- Generate a list of available survey tools.
- Recruit a representative sample of participants across the social sciences (probably other graduate students at UMass).
- Administer a survey to those graduate students to learn which survey technologies they'd used before, whether they were familiar with any programming languages, and if they have any experience with spreadsheets and/or databases.
- Ask the participants to take a survey specification and design a series of surveys with various features on one of the competing survey software packages (that they had not used before) and have them do the same tasks with SurveyMan.
- Perform a post-treatment survey to ascertain results.

We would measure the time it took to complete the task, the number of clarifying questions the participants ask, and the perceived ease of the task. It might be a good idea to have the post-study follow-up survey about a week later.

Some design features we might want to consider are whether we assign the SurveyMan task first, how much prior experience in either programming languages or survey software served as prior training, and how we present the survey tasks. If we give the participants csvs of questions and just ask them to manipulate the data, we will be biasing them to a more favorable view of SurveyMan. We might instead give them a variety of inputs, such as a csv, a pdf, an English description of a survey we might want to run, and a followup survey based on that English description.

**Expressive enough to describe the target population's surveys?** Yes. We can certainly express all of the surveys that our clients wanted to be able to express, and in some cases we were able to express more. One interesting aspect of our collaborations is that we've discovered that some things we've been calling surveys are actually experiments. While we can describe them using our language, we currently lack the support in our runtime system to perform the kinds of analyses we'd like to see. This has caused us to pursue work in a new direction.

I think the best way to address this question would be with a list of features describing what it means to be a survey. I've been thinking about the difference between experiments and surveys a lot recently and a checklist like this would be illuminating.

### Research Question 2: Is SurveyMan able to identify survey errors?

We think so. We use standard statistical methods to identify wording bias and question order bias.

For breakoff, we report the distribution of breakoff over questions and positions. Diagnosing whether we have an unusual amount of breakoff requires a prior over comparable surveys, so we don't attempt to define a threshold for whether the amount of breakoff is statistically significant.

We could model a small amount of noise that says, "if this survey is well-formed, every respondent will have equal probability of breaking off at any point before the final question." We could say that there is some small $$\epsilon$$ representing this probability, which is independent of any features of the survey. Let $$r$$ represent our sample size. Then the expected number of respondents who broke off from this survey would be $$r\epsilon$$. If the total number of respondents who broke off before the final question is unusually high, then the survey author would need to debug the survey to figure out why so many respondents were leaving early.

We're clearly be interested in where respondents were breaking off, and not just that they were breaking off. It's possible that the survey is uniformly poorly designed; it's also possible that it's too long or that there is some point in the survey (such as a transition from one block to another) that causes an unusual amount of breakoff. For a flat survey having $$n$$ questions, every position is identical insofar as the set of questions that may be seen there. In such a case, we might want to model breakoff as a series of Bernoulli trials -- at each point, a respondent may break off with probability $$\frac{\epsilon}{n-1}$$. We would then expect there to be $$\frac{\epsilon}{n - 1} \times r$$ people breaking off at each position.

The problem with this model is that although the probability of breaking off at any given point is the same, the number of participants is not \emph{because a single person cannot break off twice}. First, let's model this as a recurrence. Let $$B_i$$ be the random variable denoting expected breakoff at position $$i$$. Then we get the recurrence:

$$\mathbb{E}(B_1) = \frac{r\epsilon}{n-1}$$

$$\mathbb{E}(B_{i<n}) = \frac{\epsilon}{n-1}(r - \sum_{k=1}^{i-1}\mathbb{E}(B_k))$$

Let's see if we can do better substituting in values:

$$\mathbb{E}(B_2) = \frac{\epsilon}{n-1}\biggl(r - \frac{r \epsilon}{n-1}\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1}\biggr)$$

$$\mathbb{E}(B_3) = \frac{\epsilon}{n-1}\biggl(r - \bigl(\frac{r\epsilon}{n-1}\bigl(1 - \frac{\epsilon}{n-1}\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\biggr)$$

$$\mathbb{E}(B_4) = \frac{\epsilon}{n-1}\biggl(r - \frac{r\epsilon}{n-1}\bigl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigr)^2\bigr)\biggr) = \frac{r\epsilon}{n-1}\biggl(1 - \frac{\epsilon}{n-1} + \bigl(\frac{\epsilon}{n-1}\bigl)^2 - \bigl(\frac{\epsilon}{n-1}\bigr)^3\biggr)$$

$$\mathbb{E}(B_{i<n}) = \frac{r\epsilon}{n-1}\sum_{k=0}^{i-1}\bigl(-\frac{\epsilon}{n-1}\bigr)^k$$

Recall the closed-form formula for a geometric series, shamelessly ripped off Wikipedia:

$$\sum_{k=0}^{n-1}ar^k = a\frac{1-r^n}{1-r}$$.

Since we're overloading variable names, let's be clear about what represents what: $$a_{wiki} \equiv \frac{r\epsilon}{n-1}$$, $$r_{wiki}\equiv \frac{-\epsilon}{n-1}$$, $$n_{wiki}\equiv i$$. That means our closed form for $$\mathbb{E}(B_{i<n})$$ is $$\frac{r\epsilon}{n-1}\times\frac{1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i}{1 - \frac{-\epsilon}{n-1}}$$, which we can make prettier as $$\frac{r\epsilon}{n-1}\times\biggl(1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)\times\frac{n-1}{n - 1 + \epsilon}$$ and reduce to a still hideous $$\frac{r\epsilon}{n-1+\epsilon}\biggl(1 - \bigl(\frac{-\epsilon}{n-1}\bigr)^i\biggr)$$. No lie, I remember it being less formidable looking in my notes, but they're at my desk, and I'm pretty sloppy with the algebraic manipulation, so I welcome any spot-checking.

So what does this mean? We have a formula for the expected number of respondents who break off at each index. Now we can at least use the Markov inequality to flag really egregious positional breakoff. Since the Markov inequality is very coarse, I have little doubt that researchers would be able to spot the cases it would find. That is, Markov doesn't give us much value for small data sets (note that it's still useful as a feature in an automated tool).

In order to get tighter bounds, we are going to need to know more about the distribution. We can use Chebyshev's inequality if we know the variance. We can get a Chernoff-like bound by substituting into Chebyshev's Inequality the expectation for the moment generating function, if we have it.

I think this process is similar to a Martingale, but I will have to spend more time thinking about this. Clearly the total expected breakoff is the same. The number of participants in the survey is expected to decay. It would be great to get some tight bounds on this. Once we can characterize breakoff for flat surveys, we have a prior on the role position plays and can use this information for diagnosing breakoff in more structured surveys. I haven't considered an equivalent analysis for question-based breakoff yet.

For now, in the analyses, we just stick with reporting the top locations and questions for abandonment/breakoff.

### Research Question 3: Is SURVEYMAN able to identify random or inattentive respondents?

I am working on a more in-depth analysis to help answer this question. It's going to need a whole blog post to itself. We have some gold-standard data from Joe and company for the phonology survey, and there are some heuristics I can use for the prototypicality survey. Graphs forthcoming!