Static Analysis

Tell a PL researcher that you have a new programming language and one of the first things they want to know is whether you can do static analysis on it. Lucky for us, we can! But first, some background in this space:

Prior work on static analysis for surveys

I mentioned earlier that there is some prior work on using programming languages to address survey problems.

Static analysis in Topsl is constrained by the fact that Topsl permits users to define questions whose text depends on the answers to previous questions. Matthews is primarily concerned that each question is asked. While it can support randomization, this feature belongs to the containing language, Scheme, and so is not considered part of of the Topsl core. Although the 2004 paper mentions randomization as a feature that can be implemented, there is no formal treatment in any of the three Topsl-related publications.

If we also consider type-checking answer fields or enforce constraints on responses as static analysis, then most online survey tools, Blaise, QPL, and probably many more tools and services perform some kind of static analysis.

Static Analysis in SurveyMan

Surveys are interpreted by a finite state machine implemented in Javascript. To ensure that we don't cause undefined behavior, we must check that the csv is well-formed. After parsing, the SurveyMan runtime performs the following checks:

Forward Branch Since we require the survey to be a DAG, all branching must be forward. We can do this easily by comparing branch destination ids with the ids of the branch question's containing block. This check takes time linear in the block's depth.
Top Level Branch At the moment, we only allow branching to top level blocks. This check is constant.
Consistent Branch Paradigms Having one "true" branch question per top level block is critically important for our survey language. Every block has a flag for its "branch paradigm." This flag indicates whether there is no branching in the block, one branch question, or whether the block should emulate sampling behavior. This check ensures that for every block, if it has a parent or siblings the following relations hold:
[table id=1 /]

We use the following classification algorithm to assign branch paradigms:
function getAllQuestionsForBlock input : a block output : a list of questions begin questions <- this block's questions if this block's branch paradigm is SAMPLE q <- randomly select one of questions return [ q ] else qList <- shuffled questions for each subblock b in this block begin bqList <- getAllQuestionsForBlock(b) append bqList to qList end return qList fi end
function classifyBlockParadigm input : a block output : one of { NONE, ONE, SAMPLE } begin questions <- this block's questions subblocks <- this block's subblocks if every q in questions has a branch map branchMap <- select one branch map if every target block in branchMap is NULL return SAMPLE else return ONE else if one q in questions has a branch map return ONE else if there is no q in questions having a branch map for each b in subblocks begin paradigm <- classifyBlockParadigm(b) if paradigm is ONE return ONE end return NONE fi end
The ONE branch paradigm gets propagated up the block hierarchy and pushes constraints down the block hierarchy. The SAMPLE branch paradigm pushes constraints down the block hierarchy, but has no impact on parent blocks. Finally, NONE is the weakest paradigm, as it imposes no constraints on its parents or children. All blocks are set to NONE by default and are overwritten by ONE when appropriate.
No Duplicates We check the features of each question to ensure that there are no duplicate questions. Duplicates can creep in if surveys are constructed programmatically.
Compactness Checks whether we have any gaps in the order of blocks. Block ordering begins at 1. This check is linear in the total number of blocks. (This check should be deprecated, due to randomized blocks.)
No branching to or from top level randomized blocks We can think of the survey as a container for blocks, and blocks as containers for questions and other blocks. The top-level blocks are those immediately contained by the survey. While we permit branching from randomizable blocks that are sub-blocks of some containing block, we do not allow branching to or from top-level randomized blocks. The check for whether we have branched from a top-level block is linear in the number of top-level blocks. The check for whether we try branching to a top-level randomizable block is linear in the number of questions.
Branch map uniformity If a block has more than one branch question, then all of its questions must be branch questions. The options and their targets are expected to be aligned. In the notation we used in a previous post, for a survey csv $$S$$, if one question, $$q_1$$ in the block spans indices $$i,...,j$$ and another question $$q_2$$ spans indices $$k,...,l$$, then we expect $$S[BLOCK][i] = S[BLOCK][k] \wedge ... \wedge S[BLOCK][j] = S[BLOCK][l]$$.
Exclusive Branching Branching is only permitted on questions where EXCLUSIVE is true.

The above are required for correctness of the input program. We also report back the following information, which can be used to estimate some of the dynamic behavior of the survey:

Maximum Entropy
For a survey of $$n$$ questions each having $$m_i$$ responses, we can trivially obtain a gross upper bound on the entropy on the survey : $$n \log_2 (\text{max}(\lbrace m_i : 1 \leq i \leq n \rbrace))$$.
Path Lengths
- Minimum Path Length Branching in surveys is typically related to some kind of division in the underlying population. In the previous post, we showed how branching could be used to run two versions of a survey at the same time. More often branching is used to restrict questions by relevance. It will be appropriate for some subpopulations to see certain questions, but not others.
  
  When surveys have sufficient branching, it may be possible for some respondents to answer far fewer questions than the survey designer intended -- they may "short circuit" the survey. Sometimes this is by design; if we are running a survey but are only interested in curly-haired respondents, we have no way to screen the population over the web. We may design the survey so that answering "no" to "Do you have curly hair?" sends the respondent straight to the end. In other cases, this is not the intended effect and is either a typographical error or is a case of poor survey design.
  
  We can compute minimum path length using a greedy algorithm:
  function minPathLength input : survey output : minimum path length through the survey begin size <- 0 randomizableBlocks <- randomizable top level blocks staticBlocks <- static top level blocks branchDestination <- stores block that we will branch to for block b in randomizableBlocks begin size <- size + length(getAllQuestionsForBlock(b)) end for block b in staticBlocks begin paradigm <- b's branch paradigm if branchDestination is initialized but b is not branchDestination continue fi size <- size + length(getAllQuestionsForBlock(b)) if branchDestination is initialized and set to b unset branchDestination fi if paradigm = ONE branchMapDestinations <- b's branch question's branch map values possibleBlockDestinations <- sort branchMapDestinations ascending branchDestination <- last(possibleBlockDestinations) fi end return size end
- Maximum Path Length Maximum path length can be used to estimate breakoff. This information may also be of interest to the survey designer -- surveys that are too long may require additional analysis for inattentive and lazy responses.
  
  Maximum path length is computed using almost exactly the same algorithm as min path length, except we choose the first block in the list of branch targets, rather than the last.
  
  We verified these algorithms empirically by simulating 1,000 random respondents and returning the minimum path.
- Average Path Length Backends such as AMT require the requester to provide a time limit on surveys and a payment. While we currently cannot compute the optimal payment in SurveyMan, we can use the average path length through the survey to estimate the time it would take to complete, and from that compute the baseline payment.
  
  The average path length is computed empirically. We have implemented a random respondent that chooses paths on the basis of positional preferences. One of the profiles implemented is a Uniform respondent; this profile simply selects one of the options with uniform probability over the total possible options. We run 5,000 iterations of this respondent to compute the average path length.
  
  Note that average path length is the average over possible paths; the true average path length will depend upon the preferences of the underlying population.

Survey Language Essentials » « Regarding Survey Paths