Speeding up SurveyMan analyses
A major bottleneck in some of our analyses is that we need to resample survey responses. Let denote the number of responses we’ve seen. Let denote the number of questions in the survey. is the number of bootstrap iterations. is a list of bootstrap samples. is a list of scores. Our resampling approach is as follows:
 For each response in the list of all responses ():
 All other survey responses that have answered at least all the questions in ().

For in ():
Truncate so we are only looking at the set that overlaps with ().

For to ():
Randomly select samples from and add to the list ().

For in ():
Compute the scores for and add to the list ().
 Sort in ascending order ().
 Return true or false if ’s score falls outside the appropriate density ().
Reminder: randomization and equivalence are nontrivial
The randomization of SurveyMan introduces some nontrivial complications to the above process. I’ve written before about how (1) relies on defining the notion of question equivalence carefully. When we have variants, if we attempt to match on the literal question, we may not have a statistically significant number of samples to compare. Consider the prototypicality survey. In that case, we had 16 blocks, each having 4 variants. This means we have up to distinct surveys! Resampling won’t help us at all in that case.
Can we resample fewer times?
Although resampling becomes possible when we collapse variants into a single question, it’s still timeconsuming. Calculating scores for the phonology survey – which had almost 100 questions, and for which we gathered about 300 responses – takes upwards of 25 minutes. It would be nice if we could do this faster.
The current resampling code is very naive. As written above, we truncate the response lists first and then resample. In order to truncate, we first compare our response with our survey responses. Resampling involves drawing from the pool of survey responses times. When I am impatient, I set to 500. When I want to do things the right way, I set to 2000. Clearly the number of dominates the computation. We end up with = running time.
Caching
Would things be any better if we only computed the bootstrap samples once, and performed the truncation later? Let’s consider the following alternative algorithm:
 For to (): Randomly select samples from the responses and add to the list ().
 For each response in the list of all responses ():
 For in ():
 All other survey responses that have answered at least all the questions in ().
 For in ():
 Truncate so we are only looking at the set that overlaps with ().
 Compute the scores for and add to the list ().
 Sort in ascending order ().
 Return true or false if ’s score falls outside the appropriate density ().
 For in ():
The above gives us a running time of
Yikes! Even though we are only computing the bootstrap sample once, we need to iterate over it. This iteration occurs in an outer loop, causing a blowup in the time to compute.
There are clearly more subtle analyses we can do. The first approach only computes the bootstrap sample over the truncated responses, which are often fewer than the total number of responses. We might be concerned about the garbage collector when we recompute new samples.
Another concern I have with caching is that it introduces a bias. We are essentially reusing the same data (the bootstrap samples) and may run into multiple comparison issues. Of course, the point of the bootstrap simulation is that it approximates the true distribution and therefore this should be less of an issue the more bootstrap iterations we have.
Parallelizing
Another approach to speed things up would be to try to parallelize the score computation. When all of the analysis was written in Clojure, this would have been fine, since nothing was mutable. However, the Java implementation mutates scores in the original SurveyResponse object. We could get around this by completely decoupling the truncated view from the original SurveyResponse. I might do this to see if it makes a difference.