# On calculating survey entropy

I've been spending the past two weeks converting analyses that were implemented in Python and Julia into Clojure. The OOPSLA Artifact Evaluation deadline is June 1 and moving these into Clojure means that the whole shebang runs on the JVM (and just one jar!).

One of the changes I really wanted to make to the artifact we submit was a lower upper bound on survey entropy. Upper bounds on entropy can be useful in a variety of ways: in these initial runs we did for the paper, I found them useful for comparing across different surveys. The intuition is that surveys with similar max entropies have similar complexity, similar runtimes, similar costs, and similar tolerance to bad behavior. Furthermore, if the end-user were to use the simulator in a design/debug/test loop, they could use max entropy to guide their survey design.

We've iterated our calculation of the max entropy. Each improvement has lowered the upper bound for some class of surveys.

**Max option cardinality** Our first method for calculating maximum entropy of a survey was the one featured in the paper: we find the question with the largest number of options and say that the entropy of the survey must be less than the entropy of a survey having equal number of questions, where every question has this maximum number of answer options. Each option has equal probability of being chosen. For some $$survey$$ having $$n$$ questions, the maximum entropy would then be $$\lceil n \log_2 (\max ( \lbrace \lvert \lbrace o : o \in options(q) \rbrace \rvert : q \in questions(survey) \rbrace ) ) \rceil$$.

The above gives a fairly tight bound on surveys such as the phonology survey. For surveys that have more variance in the number of options proffered to the respondent, it would be better to have a tighter bound.

**Total survey question max entropy** We've had a calculation for total survey question max entropy implemented in Clojure for a few weeks now. For any question having at least one answer option, we calculate the entropy of that question, and sum up all those bits. For some $$survey$$ having $$n$$ questions, where each question $$q_i$$ has $$m_i$$ options, the maximum entropy would then be $$\lceil \sum_{i=1}^n \mathbf{1}_{\mathbb{N}^+}(m_i)\log_2(m_i)\rceil$$

While the total survey question max entropy gives a tighter bound on surveys with higher variance, it is still a bit too high for surveys with branching. Consider the wage survey. In Sara's initial formulation of the survey (i.e. not the one we ran), the question with the greatest number of answer options was one asking for the respondent's date of birth. The responses were dates ranging from 1900 to 1996. Most of the remaining questions have about 4 options each:

[table id=4 /]

Clearly in this case, using max option cardinality would not give much information about the entropy of the survey. The max cardinality maximum entropy calculation gives 258 bits, whereas the total survey question max entropy gives 80 bits.

This lower upper bound still has shortcomings, though -- it doesn't consider surveys with branching. For many surveys, branching is used to ask one additional question, to help refine answers. For these surveys, many respondents answer every question in the survey. However, there are some survey that are designed so that no respondent answers every question in the survey. Branching may be used to re-route respondents along a particular path. We used branching in this way when we actually deployed Sara's wage survey. The translated version of Sara's survey has two 39-question paths, with a 2-option branch question to start the survey and zero option instructional question to end the survey. This version of the survey has a max cardinality maximum entropy calculation of $$80 * \log_2 97 = 528$$ and a total survey question max entropy of 160 bits (without the ceiling operator, this is approximately equal two two times the entropy of the previous version, plus one bit for the introductory branch question).

The maximum number of bits needed to represent this survey approximately doubled from one version to the next. This isn't quite right -- we know that the maximum path through the survey is 41 questions, not 80. In this case, branching makes a significant difference in the lower bound.

**Max path maximum entropy** Let's instead compute the maximum entropy over distinct paths through the survey. We've previously discussed the computational complexity of computing distinct paths through surveys. In short, randomization significantly increases the number of possible paths through the survey; if we focus on path through blocks instead, we have more tractable results. Rather than thinking about paths through the survey as distinct lists of questions, where equivalent paths have equivalent lengths and orderings, we can instead think about them as unique sets of questions. This perspective aligns nicely with the invariants we preserve.

Our new maximum entropy calculation will compute the entropy over unique sets of questions and select the maximum entropy computed over this set. Some questions to consider are:

- Are joined paths the same path?
- If we are computing empirical entropy, should we also consider breakoff? That is, do we need the probability of answering a particular question?

We consider paths that join distinct from each other; the probability of answering that question will sum up to one, if we don't consider breakoff. As for breakoff, for now let's ignore it. If we need to compute the empirical entropy over the survey (as opposed to the maximum entropy), then we will use the subset relation to determine which questions belong to which path. That is, if we have a survey with paths $$q_1 \rightarrow q_2 \rightarrow q_4$$ and $$q_1 \rightarrow q_3 \rightarrow q_4$$, then a survey response with only $$q_1$$ answered will be used to compute the path frequencies and answer option frequencies for both paths. The maximum entropy is then computed as $$\lceil max(\lbrace -\sum_{q\in survey} \sum_{o \in ans(q)} \mathbb{P}(o \cap p) \log_2 \mathbb{P}(o \cap p) : p \in paths \rbrace) \rceil$$.

There are two pieces of information we need to calculate before actually computing the maximum entropy path. First, we need the set of paths. Since paths are unique over blocks, we can define a function to return the set of blocks over the paths. The key insight here is that for blocks that have the NONE or ONE branch paradigm, every question in that block is answered. For the branch ALL paradigm, every question is supposed to be "the same," so they will all have the same number of answer options. Furthermore, since the ordering of floating (randomizable) top level blocks doesn't matter, and since we prohibit branching from or to these blocks, we can compute the DAG on the totally ordered blocks and then just concatenate the floating blocks onto the unique paths through those ordered blocks.

The second thing we need to compute is $$\mathbb{P}(o \cap p)$$. The easiest way to do this is to take a survey response and determine which unique path(s) it belongs to. If we count the number of times we see option $$o$$ on path $$p$$, the probability we're estimating is $$\mathbb{P}(o | p)$$. We can compute $$\mathbb{P}(o \cap p)$$ from $$\mathbb{P}(o | p)$$ by noting that $$\mathbb{P}(o \cap p) = \mathbb{P}(o | p)\mathbb{P}(p)$$. This quantity is computed by $$\frac{\# \text{ of } o \text{ on path } p}{\#\text{ of responses on path } p}\times\frac{\#\text{ of responses on path } p}{\text{total responses}}$$, which we can reduce to $$\frac{\# \text{ of } o \text{ on path } p}{\text{total responses}}$$. It should be clear from this derivation that even if two paths join, the entropy for the joined sub path is equal to the case where we treat paths separately.

The maximum entropy for the max path in the wage survey, computed using the current implementation of SurveyMan's static analyses, is 81 bits -- equivalent to the original version of the survey, plus one extra bit for the branching.