We need some way of determining whether the diagnoses of SurveyMan's bugs is correct. It's always possible that a particular technique has a flaw in it, or that a test for a certain feature is not sensitive enough to detect the differences we would like it to detect. We have designed a simulator as a sanity check for our algorithms.

### Simulator setup

The first step in our simulator setup is to generate gold-standard data; this is after all the reason for bothering with a simulator in the first place.

Consider the problem of bot detection. We will need to know the ground truth of who is a bot and who is not. Modeling bots explicitly is easy. We already do this in our static analysis. Modeling human respondents is more challenging.

We define a profile to be a collection of preferences over a survey. These preferences are the probabilities that an instance of a profile (i.e. a respondent) will choose a particular answer option for a question. For example, uniform adversaries will choose each answer option with equal probability.

In order to emulate human behavior, we allow the non-bot population of responses to be drawn from some number of clusters. A cluster is generated by randomly assigning a probabilities $$p_1$$ drawn from the interval $$(1/m_i, 1)$$ for each $$q_i$$. We say that a respondent belonging to one of these clusters has a preference for a particular answer, but may choose another answer due to factors we either cannot control or did not account for. These other preferences are assigned uniform probability : $$\frac{1-p_i}{m-1}$$. Sometimes a preference will be very strong (e.g. assigned a probability > 0.8). Sometimes the preference will only be slight, in which case it will be close to $$1/m_i$$.

We can then inject biases into the generated responses and test our bias detection algorithms, testing the robustness of our techniques by varying the impact of bad actors on our results.

#### Correlation

Any measure of correlation between questions in the survey must consider what's called the "level of measurement" of each question. Levels of measurement determine the statistical tools we can use to analyze the data. There are four levels of measurement in total:

1. Nominal Data that fall into categories that have no order are said to be nominal. Generally this will correspond to radio button questions such as "What is your gender." This data will be represented by a categorical variable and permutation tests will have to be used to analyze any correlations. Tests on nominal data are sensitive to sparsity; since they are not continuous, we cannot use interpolation to make inferences.
2. Ordinal Ordered questions fall into this category. This is probably the most common type of survey question. Surveys that ask users about their preferences or to provide rankings for data are using ordinal data. The more common and powerful statistical significance and correlation tests begin at this level.
3. Interval Where ordinal questions required the ability to rank, interval questions require there to be meaningful distances between answer options. Likert scale questions are an attempt to capture interval questions (although they are often analyzed using ordinal tests, since their measurement is imperfect). Interval questions attempt to capture the magnitude of difference between ranked answers.
4. Ratio Ratio questions are "true" numeric questions - that is, individual answers have meaningful magnitude because there is a known underlying zero grounding the measurement. Weight, date of birth, and income are all ratio questions. These questions permit the most powerful statistical tests because data can be interpolated.

SurveyMan uses correlation in two ways. The CORRELATED can be used to flag sets of questions that the survey designer expects to have statistical correlation. Flagged questions can be used to validate or reject hypotheses and to help detect bad actors. Alternatively, if a question that is not marked as correlated is found to have statistically significant correlation, then we flag this question. Questions are compared on a pair-wise basis. This information can be used in a variety of ways :

• The survey designer could decide to remove one or more of the correlated questions, if their predictive power is strong enough to infer responses from the remaining questions. It is ultimately the responsibility of the survey designer to use good judgement and domain knowledge when deciding to remove questions; note that because we only check pair-wise correlation, we cannot capture the impact of groups on a particular outcome. We do not model interactions between variables.
• The survey designer could use discovered correlations to assist in identification of cohorts or bad actors by updating the entries in the CORRELATED column appropriately.

We only support automated correlation analysis between exclusive (radio button) questions. These questions may be ordered or unordered.

For two questions such that at least one of them is unordered, we return the $$\chi^2$$ statistic, its p-value, and compute Cramer's $$V$$ to determine correlation. We also use Cramer's $$V$$ when comparing a nominal and an ordinal question. Ordinal questions are compared using Spearman's $$\rho$$. Since in practice we rarely have sufficient data to return confidence intervals on such point estimates, we simply flag the pair and leave the interpretation of the values up to the survey designer.

For non-exclusive (checkbox) ordered questions, we would need a meaningful metric to understand what the relationship between subsets of checkboxes are. For example, in a question of four answer options A, B, C, and D, we would need to know how to compare the answers {A,B}, {B,C}, and {A,C}. If their values are additive and we let their weights correspond to their indices, how far apart are the choices {A,B} and {C}? Any analysis would have to be domain-specific and thus falls outside the scope of SurveyMan.

For non-exclusive (checkbox) unordered questions, we also run into trouble. We don't have to worry about specialized distance functions, but we do have to worry about the fact that our categories are not exclusive. That is, we can no longer use a categorical random variable to represent the question, since a single respondent may belong to multiple categories. This violates the conditions of all known tests. We could use subsets as our events instead and analyze them as we do with exclusive data. However, the contingency table for a question $$q_i$$ having $$m$$ options will have $$2^m - 1$$ as one of its dimensions. While Cramer's $$V$$ reduces the impact of the degrees of freedom on the $$\chi^2$$ test, we still have the problem of sparsity in the table's cells. We observed in simulation that, as sparsity increased, the range of errors increased. While we would still sometimes see the injected correlated questions show up, we also saw many more cases of a question being classified as having a correlation coefficient close to 0 when compared against itself. The misclassification wasn't too bad for three checkbox options, but it was unacceptable at 4. As a result, we do not support correlation on checkbox questions.

If users want to do correlation on checkbox questions anyway, they can enumerate the subsets and display these as exclusive questions. It's true that the very problem we try to avoid with checkbox questions could still be a problem with radio button questions. However, it's unusual in practice to have a large number of nominal choices. We could compute the required number of random respondents needed to have at least 5 entries in each cell of the contingency table and only analyze correlation if this condition is met. This is something to consider for future SurveyMan releases and requires further investigation.