Participation and Contribution in Crowdsourced Surveys, a recent PLOSOne article, discusses some interesting approaches to crowdsourced surveys. Not only are the answers crowdsourced, but the questions themselves are also crowdsourced. The surveys are seeded with a small number of questions and later augmented with questions supplied by respondents. These questions are curated by hand and presented to respondents in a random order.
Approach and comparison with Quizz
The authors accomplished this by setting up three separate websites to collect the data. The only one that is still active is for Energy Minder which is an ongoing research project. The other two surveys were about BMI and personal finance.
The motivation for this work is very similar to Quizz. The authors state:
The crowdsourcing method employed in this paper was motivated by the hypothesis that there are many questions of scientific and societal interest, for which the general (non-expert) public has substantial insight. For example, patients who suffer from a disease, such as diabetes, are likely to have substantial insight into their condition that is complementary to the knowledge of scientifically-trained experts. The question is, how does one collect and combine this non-expert knowledge to provide scientifically-valid insight into the outcome of interest.
Like Quizz, their system eschews financial incentives for survey completion. Unlike Quizz, new questions are added by the users themselves, rather than by a system. In Quizz, the objective is to complete a knowledge base -- responses to questions are point estimates. In this system, the questions serve as features designed to predict a variable of interest, whether it be energy consumption, BMI, or the amount an individual has in their savings. The paper does not explicitly state the measurement level of the outcome variable; it isn't clear if, for example, energy consumption is a binary variable (high/low), a categorical variable defined by buckets, or a real-valued prediction of kWh.
Questions, Observations, Insights
- Are there any baseline/ground-truth studies? All three surveys ask questions that could be influenced by a variety of biases (e.g., Vermonters are hippies who desire a low energy footprint, biases against overweight and poor people, etc.).
- One big advantage of using crowdsourced questions is that they can give insights for how to get around social desirability bias. This isn't discussed in the paper, but would of interest to social scientists.
- Early in the paper they state, "...a real-time machine learning module continuously builds models from this growing store of data to predict the outcome variable, and the resulting predictions are shown to the users." The machine learning module they refer to is Hod Lipson's symbolic regression package. It's not clear to me when the predictions are shown. Aren't there methodological issues with telling the respondent what you're trying to predict? Although this can sometimes be the case, social desirability and other biases may have a significant impact on the outcome variable.
- Related work: Program Boosting uses GP and crowdsourcing.
- "If the user decides to submit their own question, they are then required to respond to it. We have found that being forced to immediately respond to the question helps users to recognize when their questions are unclear or confusing." Do they have revision data for questions? Or do respondents just re-enter the question if it isn't clear? Is there feedback on question clarity, or is this something that the human curator determines? It's not clear to me how this works, but this data might be an interesting feature to use in quality control.
- Beyond surveys, this is an interesting way to collect features for some other task. The questions are basically features here.
- Problems of propagation of error could be connected to issues we're looking at in Automan vis a vis multiple comparisons.
- The learning module: could we use these techniques to build up blocks dynamically? Learn blocks?
- The validation checks (e.g. valid ranges) are a very weak adversarial model. Since there are no financial incentives for this survey, a greater threat to validity are inattentive respondents.
- I'd like to see a stronger comparison with the active learning literature. There are issues of compounded error when using stepwise regression and the kind of user-generated question dogfooding fu that's happening here. I suspect the active learning literature addresses some of these issues and would give insight into how to have greater statistical validity.
- Testing for a correlation coefficient different from 0 is too sensitive. This hardly ever happens. To guard against this, or at least establish a kind of prior on false correlations, the authors could inject seemingly unrelated questions into the survey. Of course, there is some probe bias here that could cause unintended consequences, so it would have to be thought out carefully. I'm just not satisfied with, "The lack of correlation between participation and contribution falsifies the hypothesis that a higher level of participation is indicative of interest in and knowledge of the subject area."
- Also asked above: what's the baseline? What's to stop the system from predicting the most common answer, given the class? How does this perform against a naive Bayes or decision tree classifier?
- I would like to see some regularlization in the modelling. Symbolic regression can be very sensitive to outliers. I'm not sure what's in this implementation, though. The paper would benefit from a discussion of regularization.