This post follows directly from my previous post on correctness. Rather than dive directly into the details of correctness for surveys, I wanted to spend some time thinking about some of the things we might want to prove, show or derive first.

We've stated that one way to define what we mean by survey correctness is to say that the same survey (modulo randomization) administered to the same person, under the same set of conditions, will yield the same set of results. Let's start by assuming that it is possible to repeat the survey for a respondent under the exact same conditions. Then there is is some underlying gold standard for the survey. While we cannot repeat the survey for the same set of respondents, under the exact same conditions in practice, we can emulate this scenario with our simulator. We choose a set of answers, according to a particular profile and let this be the gold standard. The simulated respondent then chooses a response according to the map from questions to answers. Once the survey is complete, we compare the results returned by the respondent with our generated gold standard. The question-answer sets for the gold standard and the returned values should always be equal.

At first, it may seem that this definition of equality is trivial. We have unique identifiers for each question and answer option that are created directly from their data's location in the input csv. The response form uses these ids. One question is displayed at a time and no questions ever repeat (we'll prove this later). We simply check the id of the question currently displayed, look up its answer in our response map, and select the option(s) that has/have matching identifier(s) to our gold standard's responses. So long as the HTML is correct, we will get back the same id, right?

Wrong. Well, a little wrong. If your survey does not use variants, you're fine. If it does, here's the problem: the variant is chosen at runtime i.e. when the respondent is actually executing the program (taking the survey). Since we cannot divine which question will be shown, we need a different way of deciding the appropriate answer.

Solution 1: Specify the random choice at compile-time, rather than run-time.

This is certainly a compelling argument, and it solves our problem of not knowing what question will be displayed at runtime. When we began work on SurveyMan, we generated all of our surveys statically. All of the randomization was done in Java. We produced the full HTML to be displayed and sent this over as the payload. We didn't have much in the way of an interpreter; we just had some jQuery running in document.ready that hid all question divs and then showed the first question. We were not playing with breakoff at this point, so a next and a previous button were always displayed (except for the first and last questions). We simply swapped which divs were visible and which weren't.

Practical issues We found considerable issues with this approach, enumerated below:

• AMT limitations While this may change at any time, historically there haven't been many degrees of freedom when deciding assignments of workers to tasks on AMT. Since our primary source of respondents was AMT, we had to keep this in mind. Since each survey was static HTML, we had to send each over in its own HIT. In the past, this meant that we would post some number of HITs at a time. The easiest way to do this was to post a survey, wait for a response, validate that response, issue "reverse" qualifications, and then post some more. The "reverse" qualification thing is an Automan technique to disqualify people who had already responded to that HIT under a different randomization.

There are some logistical issues with this approach. Posting one HIT that has one assignment at a time means that we incur significant latency in our response period. We cannot take advantage of the massive parallelism of the crowd. If we instead post the HITs in batches, we have to hedge against workers who will answer some number of HITs in a row before the reverse qualification kicks in. We end up paying for redundant collection, as well as incurring additional latency (due to the workers who completed the HIT more than once). Repeaters are a major threat to this approach -- AMT groups HITs in such a way that similar HITs appear together. The researcher can get around this automated grouping by creating a new group id and putting each HIT in its own group. Now we are back to the original latency problem, because we have one HIT wonders floating around in the MTURK ether.

Note that the AMT system is designed as follows: Requesters may either specify one HIT that has many assignments (redundancy), or they may specify many HITs that will be grouped together with the express purpose of making it easier for one person to perform many HITs. We would like to be able to take advantage of the former, and posting static HTML does not allow this.

• Page Load Time The prototypicality survey had almost 100 questions. There was an obvious lag in loading on several browsers; the contents flashed before hiding the appropriate divs. We could have denoted these divs as hidden, but we would still have the lag of loading in a large file. The problem was even worse when we loaded in media content. It was clear from this alone that some of the content would have to be loaded on an as-needed basis.
• Payload size AMT has a maximum payload size. I can look it up, but needless to say, we hit it with one of the linguistics surveys. I believe that survey had approximately 200 questions -- they were similar in nature to the phonology questions. The true payload would have been smaller, since that survey was designed to be sampled from. However, there will be cases where survey writers are going to want to simulate an infinite stream of survey questions. We would like to be able to send over as large a survey as possible, and this is just not going to happen if we send over a static webpage. Now, one could argue that the cost of the payload was really in the redundant HTML we were sending over -- we could easily have swapped out the question and option text, while still specifying the full set of questions to be seen a priori. This is a fairly compelling argument that adequately addresses both the page load time, as well as the payload size.
• Opportunities for exploitation It's very easy for someone to write a script that, for instance, selects every radio button answer and hits submit. In fact, if we have a submit button sitting around anywhere, we have a problem. We could dynamically generate all buttons as needed, while still sending over the statically chosen questions. The solution at this point would be to send over minimal HTML and instead swap in the data that we need, as needed. All buttons and control flow through the survey would be controlled by a small state machine. This would prevent respondents from writing a script to click on all input of a given type and submit. It's certainly still possible to write a "bot" to submit a form when inputs are generated dynamically, but it will now require the script to simulate the survey, which is more effort. Furthermore, the user has no guarantee that the script will terminate (although we know it will).
• Opportunities to "cheat." Besides automated "cheating," it's also possible for users to see the set of questions being asked, simply by looking at the source code. If we store the questions and option text in a lookup table, they become significantly more difficult for a human to interpret -- while they can still see what the questions are asking, human adversaries will not have the full picture. Even though there is still room for adversarial action, we hope that the increased indirection of dynamically generating the questions increases the complexity of the problem and deters adversaries from gaming the system.

The above arguments make a strong case for generating the survey dynamically. The first point is the strongest practical reason for performing randomization at runtime; for the latter points, randomizing at runtime gives us a nice defense against some adversarial behavior, but it isn't particularly compelling.

A major advantage to randomizing at runtime is that it is fundamentally more principled than randomizing a priori. Our randomization mechanism can also be framed as a an issue of nondeterministic choice -- in the case of variants, we have choose one of a set of these variants. In the case of the total set of questions, it is one of the set of all possible permutations. In principle, we can say that the behavior of the machine at runtime is the same for all permutations of the survey, since the machine itself is selecting which permutation. All equivalent surveys are statically equivalent, with only the runtime behavior differing. This will make certain properties easier to prove later on.

Solution 2: Mark every question in a variant block with equivalent answers

Recall that the problem we're trying to solve is how to define equivalence between questions. The idea behind fixing the random order at compile time was so that our two parallel universe survey takers would end up with the exact same answers. Even if you don't buy the argument to randomize at runtime, rather than randomizing at compile-time, there's still the problem of comparing across different respondents who see different variants.

A new pass solution might be to answer every question in a variant's block. However, if we do this, we will have answers for every question in that block sitting around in our answer set. The resulting set of question-answer pairs will not be equal to the gold standard set.

We could try to get around this shortcoming by removing the questions that weren't seen. This doesn't quite work either. One of the things we're trying to show is that, not only are the answers consistent (this is the part of the correctness argument that addresses bugs that are unique to surveys), but also that the same set of questions are seen (this is the part of the correctness argument that will address the more traditional programming languages issues in our design). If we just remove the other questions, we run into a dangerous situation where we are adding and removing data in the course of our correctness argument, which will make some things harder to prove. Furthermore, we will need to wait until runtime to decide whether two answer sets are equivalent.

Solution 3: Define a set of equality relations for our objects of interest

Thus far our solutions have been computational in nature -- their correctness is defined by their implementation. We'd really like a more general solution that will allow us to reason about a survey program from first principles. We would also like a solution that can be applied to a broader swatch of implementations. We'd like our definitions to be the gold standard of correctness for future survey language writers. Let's start with some definitions.

A survey is a set of blocks. A block is a heterogenous set of questions and blocks. A question is a record with a survey-scoped unique identifier and a (possibly empty) set of answer options. An option is a record with a survey-scoped unique identifier. These all follow from our definition of the syntax.

Answer Set An answer set is a set of 3-tuples of question identifiers, option identifiers, and indices.

Equality of Answer Sets

Two sets of answers are "equal" (denoted as $$\equiv_{resp}$$, called "response-equivalent") if and only if two conditions are satisfied:

1. There exists a bijective mapping between two answer sets.
2. The tuples in the mapping satisfy an equivalence relation, which we will define.
Begin Algebra Tangent

I asked Dan Stubbs if there was an algebraic structure where these properties would make sense. He suggested that perhaps I was defining a quotient group. After a nap and some perusing of Wikipedia (and a dense Algebra text), I'm not entirely sure this is what I want. Here are my unstructured thoughts on this:

• The answer set is a subset of all possible answers. All possible answers for each question type are defined in the language as follows:
• Exclusive (Radio Button) questions The set of individual response options.
• Non-exclusive (Checkbox) questions The power set of response options minus one (you cannot have the empty set -- that is, the respondent must check at least one box).
• Regex questions Any string satisfying the regex, up to the size allowable in the response field (would have to look this up).
• Freetext questions The subset of all strings up to the size allowable in the response field.
• Since we have a clear mapping between questions and answers, we can talk about questions instead. Really, when we say that the response set is invariant under randomization, we're saying that the question-answer pairs should be invariant under randomization.
• According to Wikipedia

In mathematics, specifically group theory, a quotient group (or factor group) is a group obtained by aggregating similar elements of a larger group using an equivalence relation that preserves the group structure.

What's the larger group here? Presumably it would be the set of all possible answers. This doesn't seem right, thought -- we don't want the set of all answers, because this would mean that a valid response could include answering, say, two questions from the same variant block. Instead, I think it ought to be the set of unique answer sets. We're not worrying about breakoff for now, so we don't need to define this as the power set of some set of answers. If we have $$n$$ questions, each with $$m$$ radio-button (exclusive) options and no variants and no breakoff, then we have $$m^n$$ possible answer sets. To account for variants, we multiply by the cardinality of each variant set.

What are the smaller groups? They're the set of question-answer pairs we've collapsed. Now what's the group structure? To be honest, I'm not entirely sure an answer set satisfies the definition of a group. For starters, what's our group operator? I suppose a tempting first choice might be a set addition operator. We could say that if $$\lbrace (q_i, a_i, i) \rbrace$$ and $$\lbrace (q_j, a_j, j) \rbrace$$ are each in the set of possible question-answer pairs, then $$\lbrace (q_i, a_i, i), (q_j, a_j, j)\rbrace$$ is also in the set of possible question-answer pairs. Of course, for $$\lbrace (q_i, a_i, i) \rbrace$$ and $$\lbrace (q_j, a_j, j) \rbrace$$ to be floating around, we would need to allow breakoff. Now, what would happen if we tried to combine two complete answer sets? If the answer sets are equivalent, we get back the same answer set. If the answer sets are equivalent, modulo variant questions, how do we decide what's returned? If the answer sets are radically different, then what?

I think at this point, we need to be talking about questions, rather than answer sets. Two answer sets can differ either because the questions seen are different (due to breakoff or variants) or because the answers chosen are different, or both. If we only consider questions, we can define a question-combining operator (rather than a response-combining operator) and get the properties we want. We would need to define some unified variable to represent the set of variants, so any time a variant is combined, we represent the variant with that variable.

And this is about where I'm stuck. It's not clear to me that this kind of algebraic structure is what we want. Wait -- why would we want it in the first place? Well, I like challenging myself to think about these problems from a variety of angles, so it's intellectually satisfying as an exercise in its own right. But it could also prove fruitful later down the line. It might be the case that by having these algebraic properties, we get some proofs of correctness or other properties for free. Reducing a known and well-studied problem to a new domain for the win!

End Algebra Tangent

So basically, we need to define this bijective mapping. Here's the short version (since this post has been sitting, incomplete and open in a tab for well over a month):

Two questions are eqivalent if either:

• The questions are identical. That is, they have the same ids (and should belong to the same block, have the same flags, etc. Questions are immutable entries in a spreadsheet, so you could only get two questions with equal ids but unequal other stuff if your lexer/parser/compiler screwed something up).
• The questions are both contained in a block having type ALL. In this case, the number of options will be equivalent. The order of their entry denotes equivalence. That is, the $$i^{th}$$ option for each question variant is assumed to be equal in its meaning.

Now here's the punchline: since we use different paths through the blocks (and thus different containing block ids) to denote an experimental control (really, a "control path") and to denote experimental designs such as Latin squares, we currently have no way in either our static nor our dynamic analyses to understand equivalence across these paths. Although we can express experimental surveys using SurveyMan, our system currently cannot analyse these surveys as experiments.