On Experiments vs Surveys (Ramble Time!)

Posted on July 9, 2014

“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis.
...
The core challenge is that most big data that have received popular attention are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.
Lazer et al.

I wrote previously about the range of experimental paradigms available to us on the web. In that post, we looked at observational studies, surveys, quasi-experiments, and experiments on axes that captured the range of control available to us. We might to consider other axes as well:

Exploratory vs. Confirmatory

Observational studies and surveys can be considered more exploratory, whereas quasi- and true experiments are more confirmatory. This follows from the level of control we expect to have. Observational studies, as we've said before, are a lot like data mining. The environment that we are observing is rich with potential features, but may also be sparse with respect to observations on those features. Suppose we were interested in improving a recommendation system. We might start by looking for correlations between products. That is, we might observe that people who watched The Wizard of Oz also watched It's a Wonderful Life on Netflix, or people who bought Bishop's Machine Learning book also bought Hastie et al's Elements of Statistical Learning on Amazon. We would quickly find that while there are a few strongly correlations, we would have a long tail of single observations. Furthermore, our observations do not differentiate between a lack of information and negative training data. That is, we do not know if someone hasn't watched 2 Fast 2 Furious because they didn't know it existed or if they didn't watch it because they already know they won't like it. Many of these recommender systems have user-provided ratings to help differentiate these two categories, but they cannot cover all the cases, since ratings are still opt-in.

While I am personally not particularly interested in recommender systems, they do provide a nice example of how different instruments correspond to exploratory vs. confirmatory data analysis. We could start by extracting interesting features from our user and product base. By "interesting" we mean features that have sufficient observations, that are as close to orthogonal as possible. The researcher should use domain knowledge to enumerate features.

Slight tangent: One thing I want to point out at this point is that enumerating features is the first step in generating a hypothesis we might want to test. In a lot of ways, features are the source of the hypotheses we're testing. Features are the basic building blocks of hypotheses. Emery and I discussed recently the idea that hypotheses are like a probabilistic grammar over these features. Typically in machine learning research, the model is the subject of interest. The model is a representation of the variables of interest. What is the relationship between the features and the variables? Features are either directly observed or computed from the data. Personally, I would say that "computed from the data" here should have kind of bound on it. Clearly in the case of text classification, surface string suffixes are valid features. But what about parts of speech? If you are performing co-reference resolution or doing some kind of entity identification, then parts of speech could be a useful feature. But what if your task *is* part of speech tagging? Would it be fair to use the tags from one part of speech tagger as input to another? What if you did a first pass with your tagger, and then used that information as input on another run (that is, your tagger can take its own output as input)? Now what are the differences between variables in your model and features? /endrant

We would begin our exploratory analyses by looking at the features and the raw data and seeing what we find there. We can use this information or our own domain knowledge to construct a survey. The purpose of this survey is to collect data we might not otherwise have access to. For example, we might explicitly ask respondents who use Netflix about their viewing preferences. As discussed in my earlier post, this helps us fill in some data we might not have already.

A difference we generally see between traditional surveys and what we'll call experimental questionnaires is the presence of a hypothesis to test. We'll get into this in a later blog post, but generally in surveys we do not have the notion of a null hypothesis, nor an experimental control. I would say that surveys have their origins in much more qualitative goals. While they are used to guide decision-making, they are also widely used to help tell a narrative. This isn't to say that they can't or aren't being used They can be manipulated easily, if they are not designed well, and it can be difficult to differentiate between well-designed surveys and poorly designed surveys. This susceptibility to manipulation strongly motivated our work on SurveyMan.

We noticed that there were considerable differences in the objectives and designs of the surveys our collaborators were using. For example, our collaborators in linguistics used surveys in a more experimental context -- the questions in their surveys bore strong similarity to each other and their differences could be subtle to detect. Many of the questions belonged to discrete categories (what we might call "factors" in experimental design). There wasn't much variability between questions. Those surveys were very focused on acquiring enough information to answer a particular question, or test a particular hypotheses.

The wage survey we ran was quite different. The questions were structurally heterogenous. None of the questions were "redundant" (that is, they did not appear to represent categories of questions that are equivalent in some way). The nature of survey design will always inject some bias for generating hypotheses -- for example, the wage survey did not ask questions about favorite foods or whether the respondents had a family history of breast cancer. The authors of the wage survey were clearly interested in the relationships between education, employment, attitudes toward the AMT marketplace, and willingness to negotiate wages. However, the survey was not encoded as an experiment. It was more exploratory. We observed some interesting behavior in the context of randomization and breakoff, and would like to use this information in a future study.

Missing Data

In a survey, if we have missing data, it's due to breakoff. We are faced with some choices: if breakoff occurs more frequently at a particular question, we ought to look at the question and see what's going on -- is it worded poorly? Are there insufficient options to cover potential responses? Is the question potentially offensive? If breakoff occurs most frequently at a position, this might indicate that the survey is too long, or that there's a jarring shift between blocks in the survey. We would generally use this information to help debug the survey.

In an experiment, it's not quite clear that the researcher would use the same information in the same way. Since we expect experiments to have some redundancy built in, breakoff at a particular question might tell us less than breakoff at a question type. That is, we might suspect there to be a latent variable influencing breakoff. Our analysis will have more hypotheses to consider when diagnosing the cause of breakoff.

Confounding variables

A major threat to validity in experiments that is not typically addressed (so far as I can tell) is the failure to model confounding variables. This is where we might use surveys, or a survey-like section of an experiment, to help identify these variables. We've done this sort of thing before -- in the linguistics surveys, we had a demographics section. We could expand this and ask more questions in an attempt to capture these other variables. In any case, there is something qualitatively different about the demographic questions, when compared with the core questions of interest. In both surveys and experiments, we can view demographic questions as a path to stratifying the population. However, there seems to me to be a clearer divide between how these questions are used in surveys versus experiments.

Correlation vs. Causation

With surveys, we are looking primarily for two things: the raw data (e.g. 35% of Netflix subscribers have viewed The Wizard of Oz) and correlations (viewing It's a Wonderful Life is positively correlated with having viewed The Wizard of Oz more than once). In experiments, we are also interested in the raw data and correlations, but we are also looking for causal relationships and patterns in the latent variables. If we are lucky, we might be able to find causal relationships in a dataset, but the permutations required to infer a causal relationship may be too spare for us to discover them through data analysis alone. Instead, we will need to design experiments to ensure coverage of the space. This will require some changes to the specification of the survey language, to capture the abstractions we need.

[Read More]

What does it mean for a survey to be "correct"? -- Part I.

Posted on June 19, 2014

One of the arguments we keep making about SurveyMan is that our approach allows us to debug surveys. We talk about the specific biases (order, wording, sampling) that skew the results, but one thing that we don’t really emphasize in the tech report is what it really means for a survey to be correct. Here I’d like to tease out various notions of correctness in the context of survey design.

#Response Set Invariant

Over the past few months, I’ve been explaining the central concept of correctness as one where we’re trying to preserve an invariant: the set of answers returned by a particular respondent. The idea is that if this universe were to split and you were not permitted to break off, the you in Universe A would return the same set of answers as the you in Universe B. That is, your set of answers should be invariant with respect to ordering, branching, and sampling.

The Real Kirk This guy These guys should give the same responses to a survey, up to the whole timeline split thing.

Invalidation via biases Ordering and sampling can be flawed and when they are, they lead to our order bias and wording bias bugs. Since we randomize order, we want to be able to say that your answer to some question is invariant with respect to the questions that precede it. Since we sample possible wordings for variants, we want to be able to say that each question in an ALL block is essentially "the same." We denote "sameness" by treating each question wording variant as exchangeable and aligning the answer options. The prototypicality survey best illustrates variants, both in the question wording and the option wording:

/_tablepress_tables/2014-06-18-prototypicality-csv.html %}}

[table id=5 /]

Invalidation via runtime implementation

Clearly ordering and sampling are typical, well-known biases that we are framing as survey bugs. These bugs violate our invariant. We have also put restrictions on the survey language in order to preserve this invariant. I discussed these restrictions previously, but didn’t delve into their relationship with correctness. In order to understand the correctness of a survey program, we will have to talk a little about semantics.

Generally when we talk about language design in the PL community, we have two sets of rules for understanding a language: the syntax and the semantics. As with natural language, the syntax describes the rules for how components may be combined, whereas the semantics describes what they “really mean”. The more information we have encoded in the syntax, the more we can check at compile time – that is, before actually executing the program. At its loosest, the syntax describes a valid surface string in the language, whereas the semantics describes the results of an execution.

When we want “safer” languages, we generally try to encode more and more information in the syntax. A common example of this is type annotation. We can think of this extra information baked into the language’s syntax as some additional structure to the program. Taking the example of types, if we say some variable foo has type string and then try to use it in a function that takes type number, we should fail before we actually load data into foo – typically before we begin executing the program.

In SurveyMan, these rules are encoded in the permitted values of the boolean columns, the syntax of the block notation, and the permitted branch destinations. Let’s first look at the syntax rules for SurveyMan.

##SurveyMan Syntax : Preliminaries

Before we define the syntax formally, let us define two functions we’ll need to express our relaxed orderings over questions, blocks, and column headers:</p>

$$ \sigma_n : \forall (n : int), \text{ list } A \times n \rightarrow \text{ list } A \times n$$

This is our random permutation. Actually, it isn't *really* a random permutation, because we are not using the randomness as a resource. Instead, we will be using it more like a nondeterministic permutation (hence the subscript "n"). We will eventually want to say something along the lines of "this operator returns the appropriate permutation to match your input, so long as your input is a member of the set of valid outputs." $$\sigma_n$$ takes a list and its length as arguments and returns another list of the same length. We can do some hand-waving and say that it's equivalent to just having a variable length argument whose result has the same number of arguments, so that

$$ \sigma_n([\text{apples ; oranges ; pears}], 3) \equiv \sigma_n(\text{apples, oranges, pears})$$

and the set of all valid outputs for this function would be

$$ \lbrace \text{[apples ; oranges ; pears], [apples ; pears ; oranges], [oranges ; pears ; apples],}$$
$$\quad \text{[oranges ; apples ; pears], [pears ; apples ; oranges], [pears ; oranges ; apples]} \rbrace$$.

We are going to abuse notation a little bit more and say that in our context, the output is actually a string where each element is separated by a newline. So, if we have for example

$$\sigma_n(\text{apples, oranges, pears}) \mapsto [\text{oranges ; pears ; apples}]$$,

then we can rewrite this as

$$\sigma_n(\text{apples, oranges, pears}) \mapsto \text{oranges}\langle newline \rangle \text{pears}\langle newline \rangle \text{apples}$$,

where $$\langle newline \rangle$$ is the appropriate newline symbol for the program/operating system.

Why not encode this directly in the typing of $$\sigma_n$$? The initial definition makes the syntax confined and well-understood for the (PL-oriented) reader. Since we're defining a specific kind of permutation operator, we want its type to be something easy to reason about. The "true" syntax is just some sugar, so we have some easy-to-understand rewrite rules to operate over, rather than having to define the function as taking variable arguments and returning a string whose value is restricted to some type. Note that I haven't formalized anything like this before, so there might be a "right way" to express this kind of permutation operator in a syntax rule. If someone's formalized spreadsheets or databases, I'm sure the answer lies in that literature.

The above permutation operator handles the case where we have a permutation where we can use some kind of equality operator to match up inputs and outputs that are actually equal. Consider this example: suppose there are two rules in our syntax:

$$\langle grocery\_list \rangle ::= \langle fruit\rangle \mid \sigma_n(\langle fruit\rangle,...,\langle fruit\rangle)$$
$$\langle fruit \rangle ::= \text{ apple } \mid \text{ orange } \mid \text{ pear } \mid \text{ banana } \mid \text{ grape }$$

Suppose store our grocery list in a spreadsheet, which we export as a simple csv:
apple banana orange
Then we just want to be able to match this with the $$grocery\_list$$ rule.

Next we need to introduce a projection function: a function to select one element from an ordered, fixed-sized list (i.e. a tuple). This will actually be a family of functions, parameterized by the natural numbers. The traditional projection function selects elements from a tuple and is denoted by $$\pi$$. If we have some tuple $$(3,2,5)$$, then we define a family of projection functions $$\pi_1$$, $$\pi_2$$, and $$\pi_3$$, where $$\pi_1((3,2,5))=3$$, $$\pi_2((3,2,5))=2$$, and $$\pi_3((3,2,5))=5$$. Our projection function will have to operator over list, rather than a tuple, since it will be used to extract values from columns and since we will not know the total number of columns until we read the first line of the input. The indices correspond to the column indices, but since columns may appear in any order, we must use the traditional permutation operator to denote the correct index:

$$\pi_{\sigma(\langle column \rangle)}(\langle col\_data \rangle_{(\sigma^{-1}(1))}, \langle col\_data \rangle_{(\sigma^{-1}(2))},...,\langle col\_data \rangle_{(\sigma^{-1}(n))})$$

Now we have some set of projection functions, $$\pi_{\sigma(\langle column \rangle)}$$, the size and values of which we do not know until we try to parse the first row of the csv. However, once we parse this row, the number of columns ($$n$$ above) and their values are fixed. Furthermore, we will also have to define a list of rules for the valid data in each column, but for column headers recognized by the SurveyMan parser, these are known ahead of time. All other columns are treated as string-valued. We parameterize the column data rules, using the subscripts to find the appropriate column name/rule for parsing.

SurveyMan Syntax : Proper

$$\langle survey \rangle ::= \langle block \rangle \mid \sigma_n(\langle block \rangle, ...,\langle block \rangle)$$

$$\langle block \rangle ::= \langle question \rangle \mid \sigma_n(\langle block \rangle, ..., \langle block \rangle)$$
$$\quad\quad\mid \; \sigma_n(\langle question \rangle,...,\langle question \rangle) \mid \sigma_n(\langle question \rangle, ... ,\langle block \rangle,...)$$

$$\langle row \rangle ::= \langle col\_data \rangle_{(\sigma^{-1}(1))},...,\langle col\_data \rangle_{(\sigma^{-1}(n))} $$

$$\langle question \rangle ::= \langle other\_columns \rangle\; \pi_{\sigma(\mathbf{QUESTION})}(\langle row \rangle)$$
$$\quad\quad\quad\quad\quad\;\langle other\_columns\rangle\;\pi_{\sigma(\mathbf{OPTIONS})}(\langle row \rangle) \;\langle other\_columns \rangle$$
$$\quad\quad\mid\;\langle question \rangle \langle newline\rangle \langle option \rangle$$

$$\langle option \rangle ::= \langle empty\_question\rangle \; \langle other\_columns\rangle \; \pi_{\sigma(\mathbf{OPTIONS})}(\langle row \rangle) \; \langle other\_columns \rangle$$
$$\quad\quad\mid\;\langle other\_columns\rangle \;\langle empty\_question\rangle \; \pi_{\sigma(\mathbf{OPTIONS})}(\langle row \rangle) \; \langle other\_columns \rangle$$
$$\quad\quad\mid\;\langle other\_columns\rangle \; \pi_{\sigma(\mathbf{OPTIONS})}(\langle row \rangle) \; \langle empty\_question\rangle \;\langle other\_columns \rangle$$
$$\quad\quad\mid\;\langle other\_columns\rangle \; \pi_{\sigma(\mathbf{OPTIONS})}(\langle row \rangle) \; \langle other\_columns\; \rangle\langle empty\_question\rangle \;$$

$$\langle empty\_question \rangle ::= \langle null \rangle$$

$$\langle other\_columns \rangle ::= \langle null \rangle \mid \langle other\_column \rangle \; \langle other\_columns \rangle$$

$$\langle other\_column \rangle ::= \langle repeatable\_column \rangle$$
$$\quad\quad \mid\; \langle nonrepeatable\_column \rangle $$
$$\quad\quad \mid \;\langle changable\_column\rangle$$

$$\langle nonrepeatable\_column \rangle ::= \;, \;\pi_{\sigma(\mathbf{FREETEXT})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{FREETEXT})}(\langle row \rangle)$$

$$\langle repeatable\_column \rangle ::= \; , \;\pi_{\sigma(\mathbf{BLOCK})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{BLOCK})}(\langle row \rangle)$$
$$\quad\quad\mid\; , \;\pi_{\sigma(\mathbf{EXCLUSIVE})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{EXCLUSIVE})}(\langle row \rangle)$$
$$\quad\quad\mid\; , \;\pi_{\sigma(\mathbf{ORDERED})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{ORDERED})}(\langle row \rangle)$$
$$\quad\quad\mid\; , \;\pi_{\sigma(\mathbf{RANDOMIZE})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{RANDOMIZE})}(\langle row \rangle)$$
$$\quad\quad\mid\; , \;\pi_{\sigma(\mathbf{CORRELATION})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{CORRELATION})}(\langle row \rangle)$$

$$\langle changable\_column \rangle ::= , \;\pi_{\sigma(\mathbf{BRANCH})}(\langle row \rangle) \mid \pi_{\sigma(\mathbf{BRANCH})}(\langle row \rangle)$$
$$\quad\quad\mid\; , \;\pi_{\sigma(\langle user\_defined \rangle)}(\langle row \rangle) \mid \pi_{\sigma(\langle user\_defined \rangle)}(\langle row \rangle)$$

$$\langle col\_data \rangle_{(\mathbf{FREETEXT})} ::= \mathbf{TRUE} \mid \mathbf{FALSE} \mid \mathbf{YES} \mid \mathbf{NO} \mid \langle string \rangle \mid \#\;\lbrace\; \langle string \rangle \; \rbrace$$
$$\langle col\_data \rangle_{(\mathbf{BLOCK})} ::= (\_|[a-z])?[1-9][0-9]*(\setminus .\_?[1-9][0-9]*)*$$
$$\langle col\_data \rangle_{(\mathbf{BRANCH})} ::= \mathbf{NEXT} \mid [1-9][0-9]*$$
$$\langle col\_data \rangle_{(\mathbf{EXCLUSIVE})} ::= \mathbf{TRUE} \mid \mathbf{FALSE} \mid \mathbf{YES} \mid \mathbf{NO}$$
$$\langle col\_data \rangle_{(\mathbf{ORDERED})} ::= \mathbf{TRUE} \mid \mathbf{FALSE} \mid \mathbf{YES} \mid \mathbf{NO}$$
$$\langle col\_data \rangle_{(\mathbf{RANDOMIZE})} ::= \mathbf{TRUE} \mid \mathbf{FALSE} \mid \mathbf{YES} \mid \mathbf{NO}$$

Notes :

The last match for the $$\langle block\rangle$$ rule should really be four rules -- one for one question and one block, one for one question and many blocks, one for one block and many questions, and one for many blocks and many questions, but I thought that would be cumbersome to read, so I shortened it.

All $$\langle col\_data \rangle$$ rules not explicitly listed are unconstrained strings.

Remember, the subscripted $$n$$ in $$\sigma_n$$ is for nondeterminism; the $$n$$ indexing the column data denotes the number of input columns.

We also accept json as input; the json parser assigns a canonical ordering to the columns, so generated ids and such are equivalent.

</small>

What's worth noting here are the things that we intend to check on the first pass of the parser and compare them against those that require another iteration over the structure we've built. We would like to be able to push as much into the syntax as possible, since failing earlier is better than failing later. On the other hand, we want the syntax of the language to be clear to the survey writer; if the survey writer struggles to just get past the parser, we have a usability problem on our hands.

A good example of the limitations of what can be encoded where is the BRANCH column. Consider the following rules for branching:

Branch only to top-level blocks.
Branch only to static (i.e. non-floating) blocks.
Only branch forward.
Either branch to a specific block, or go to the next block (may be floating).

(1) and (2) are encoded in the grammar -- note that the regular expression does not include the subblock notation used in the BLOCK column's regular expression. Furthermore, the BRANCH rule will not match against floating blocks. Any survey that violates these rules will fail early.

(3) is not encoded directly in the grammar because we must build the question and top-level containing block first. Since it cannot be verified in the first pass, we perform this check after having parsed the survey into its internal representation.

(4) cannot be determined until runtime. This is an example of where our syntax and static checks reach their limits. $$\mathbf{NEXT}$$ is typically used when we want to select exactly one question from a block and continue as usual. A user could conceivably use $$\mathbf{NEXT}$$ as a branch destination in the usual branching context; the only effect this could have would be the usual introduction of constraints imposed by having that block contain a "true" branch question.

This leads into my next post, which will (finally!) be about the survey program semantics.

[Read More]

Celebrities : they're just like us!

Posted on June 16, 2014

Hopefully this post will be short and sweet, since I'm trying to get back on EST.

Highlights from this year's PLDI:

Year of PLASMA! John Vilk's DoppioJVM won Best Artifact. His talk was great, and he event got a mid-talk round of applause for a meta-circular evaluator joke. Nothing like Scheme to whet the appetites of PL nerds! (I admit it, I clapped and laughed, too.)
Year of PLASMA! In a surprising turn of events, my work on SurveyMan won top prize in the Graduate category of the ACM Student Research Competition. This means I'll submit a short paper in a few months to compete in the Grand Finals! Exciting!
The APPROX was awesome! It was very exciting to see current work presented across approximate computing and probabilistic programming. Emery was chair of the event. Given the amount of discussion it engendered, I would say it was a resounding success.
I met a bunch of new people, and connected with those I haven't seen in a while. Shoutouts to Adrian Sampson and Michael Carbin. I'll be following Adrian's blog now, and pestering Michael about formalizing the SurveyMan semantics (using his work on reasoning about relaxed programs as a guide).
A cheeky dig at the New York Times lead to Phil Wadler telling me that I had the best teaser! Famous professors : they're just like us!
Shriram Krishnamurthi declared he'd read this blog.

In other news, I need to try uploading my VM for the OOPSLA artifact evaluation, now that I have reasonable internet again. But first, I need to sleep (though I did set aside time to watch GoT -- OMG, the ending was awesome! Arya's face! That exchange! WTF just happened?!?!? Also, shit's finally starting to get real, north of the wall! You know nothing, Jon Snow...)

[Read More]

Reading Rainbow

Posted on May 17, 2014

It's only a few short weeks until PLDI 2014. Oh, the tedious and expensive travel! Just kidding (well, not really -- it will involve quite a few trains and many, many dollars).

Inspired by Alex Passos's yearly NIPS reading list, I'm going throw together one of my own. Rather than listing abstracts, I'm going to just post an ordered list of the papers I'm going to read and post on individual papers as I see fit.

Tier 1 : Authors I know

Unless the conference is massively multi-tracked, I find having to ask if someone I've actually met and spoken with IRL if they have a paper at the conference a bit tactless. This isn't to say I haven't done it, or that I've done so in a completely shameless way. I do however recognize that refraining from such behavior is A Good Thing.

Doppio: Breaking the Browser Language Barrier
John Vilk, University of Massachusetts, Amherst; Emery Berger, University of Massachusetts, Amherst.
Expressing and Verifying Probabilistic Assertions
Adrian Sampson, University of Washington; Pavel Panchekha, University of Washington; Todd Mytkowicz, Microsoft Research; Kathryn S McKinley, Microsoft Research; Dan Grossman, University of Washington; Luis Ceze, University of Washington.
Resugaring: Lifting Evaluation Sequences through Syntactic Sugar
Justin Pombrio, Brown University; Shriram Krishnamurthi, Brown University.
Taming the Parallel Effect Zoo: Extensible Deterministic Parallelism with Lvish
Lindsey Kuper, Indiana University; Aaron Todd, Indiana University; Sam Tobin-Hochstadt, Indiana University; Ryan R. Newton, Indiana University.
Introspective Analysis: Context-Sensitivity, Across the Board
Yannis Smaragdakis, University of Athens; George Kastrinis, University of Athens; George Balatsouras, University of Athens.
Dynamic Space Limits for Haskell
Edward Z. Yang, Stanford University; David Mazières, Stanford University.

Tier 2 : Authors my advisor knows

It's a reasonable assumption that my advisor probably knows at least one author on each paper, so we can also call this category "Authors whom I might reasonably expect to be introduced to by My Advisor." These papers include authors whose work I've read before and whose names I know from discussions with my advisor. Reading these papers will help prevent the awkward standing-there thing that happens when someone who is much more comfortable than you are (er, than I am) is deep in a conversation and you (I) have nothing to add. It'll also provide a hook that's socially more acceptable to whatever random thought happens to be passing through your (my) head. Genius, this plan is!

Fast: a Transducer-Based Language for Tree Manipulation
Loris D'Antoni, University of Pennsylvania; Margus Veanes, Microsoft Research; Benjamin Livshits, Microsoft Research; David Molnar, Microsoft Research.
Automatic Runtime Error Repair and Containment via Recovery Shepherding
Fan Long, MIT CSAIL; Stelios Sidiroglou-Douskos, MIT CSAIL; Martin Rinard, MIT CSAIL.
Adapton: Composable, Demand-Driven Incremental Computation
Matthew A. Hammer, University of Maryland, College Park; Yit Phang Khoo, University of Maryland, College Park; Michael Hicks, University of Maryland, College Park; Jeffrey S. Foster, University of Maryland, College Park.
FlashExtract : A Framework for Data Extraction by Examples
Vu Le, UC Davis; Sumit Gulwani, Microsoft Research Redmond.
Test-Driven Synthesis
Daniel Perelman, University of Washington; Sumit Gulwani, Microsoft Research Redmond; Dan Grossman, University of Washington; Peter Provost, Microsoft Corporation.
Consolidation of Queries with User Defined Functions
Marcelo Sousa, University of Oxford; Isil Dillig, Microsoft Research; Dimitrios Vytiniotis, Microsoft Research; Thomas Dillig, UCL; Christos Gkantsidis, Microsoft Research.
Atomicity Refinement for Verified Compilation
Suresh Jagannathan, Purdue University; Vincent Laporte, INRIA Rennes; Gustavo Petri, Purdue University; David Pichardie, INRIA Rennes; Jan Vitek, Purdue University.

Tier 3 : The Competition

The Student Research Competition, that is. Some of those presenting at SRC are also presenting work at the main event. Since we'll presumably have some forced socialization, it's probably a good call to get an idea of what some of their other work is about.

A Theory of Changes for Higher-Order Languages - Incrementalizing Lambda-Calculi by Static Differentiation
Yufei Cai, Philipps-Universität Marburg; Paolo G. Giarrusso, Philipps-Universität Marburg; Tillmann Rendel, Philipps-Universität Marburg; Klaus Ostermann, Philipps-Universität Marburg.
Commutativity Race Detection
Dimitar Dimitrov, ETH Zurich; Veselin Raychev, ETH Zurich; Martin Vechev, ETH Zurich; Eric Koskinen, New York University.
Code Completion with Statistical Language Models
Veselin Raychev, ETH Zurich; Martin Vechev, ETH Zurich; Eran Yahav, Technion.
Verification Modulo Versions: Towards Usable Verification
Francesco Logozzo, Microsoft Research; Manuel Fahndrich, Microsoft Research; Shuvendu Lahiri, Microsoft Research; Sam Blackshear, University of Colorado at Boulder.
Adaptive, Efficient Parallel Execution of Parallel Programs
Srinath Sridharan, University of Wisconsin-Madison; Gagan Gupta, University of Wisconsin-Madison; Gurindar Sohi, University of Wisconsin-Madison.
Globally Precise-restartable Execution of Parallel Programs
Gagan Gupta, University of Wisconsin-Madison; Srinath Sridharan, University of Wisconsin-Madison; Gurindar S. Sohi, University of Wisconsin-Madison.

There are 13 participants in the SRC total. Five are presenting at the conference proper (One is on a paper in another tier).

Tier 4 : Pure Interest

No motivation, except that the papers look interesting.

Improving JavaScript Performance by Deconstructing the Type System
Wonsun Ahn, University of Illinois at Urbana Champaign; Jiho Choi, University of Illinois at Urbana Champaign; Thomas Shull, University of Illinois at Urbana Champaign; Maria Garzaran, University of Illinois at Urbana Champaign; Josep Torrellas, University of Illinois at Urbana Champaign.
Automating Formal Proofs for Reactive Systems
Daniel Ricketts, UC San Diego; Valentin Robert, UC San Diego; Dongseok Jang, UC San Diego; Zachary Tatlock, University of Washington; Sorin Lerner, UC San Diego.
Tracelet-Based Code Search in Executables
Yaniv David, Technion; Eran Yahav, Technion.
Getting F-Bounded Polymorphism into Shape
Benjamin Lee Greenman, Cornell University; Fabian Muehlboeck, Cornell University; Ross Tate, Cornell University.
Compositional Solution Space Quantification for Probabilistic Software Analysis
Mateus Borges, Federal University of Pernambuco; Antonio Filieri, University of Stuttgart; Marcelo D'Amorim, Federal University of Pernambuco; Corina S. Pasareanu, Carnegie Mellon Silicon Valley, NASA Ames; Willem Visser, Stellenbosch University.
Test-Driven Repair of Data Races in Structured Parallel Programs
Rishi Surendran, Rice University; Raghavan Raman, Oracle Labs; Swarat Chaudhuri, Rice University; John Mellor-Crummey, Rice University; Vivek Sarkar, Rice University.
VeriCon: Towards Verifying Controller Programs in Software-Defined Networks
Thomas Ball, Microsoft Research; Nikolaj Bjorner, Microsoft Research; Aaron Gember, University of Wisconsin-Madison; Shachar Itzhaky, Tel Aviv University; Aleksandr Karbyshev, Technical University of Munich; Mooly Sagiv, Tel Aviv University; Michael Schapira, Hebrew University of Jerusalem; Asaf Valadarsky, Hebrew University of Jerusalem.
AEminium: A permission based concurrent-by-default programming language approach
Sven Stork, Carnegie Mellon University; Karl Naden, Carnegie Mellon University; Joshua Sunshine, Carnegie Mellon University; Manuel Mohr, Karlsruhe Institute of Technology; Alcides Fonseca, University of Coimbra; Paulo Marques, University of Coimbra; Jonathan Aldrich, Carnegie Mellon University
First-class Runtime Generation of High-performance Types using Exotypes
Zachary DeVito, Stanford University; Daniel Ritchie, Stanford University; Matt Fisher, Stanford University; Alex Aiken, Stanford University; Pat Hanrahan, Stanford University.

Why is pure interest ranked last?

People say that a talk should be an advertisement for the paper. If I don't get through the papers in tier 4 before PLDI, I'll at least know which talks I want to go to and perhaps prune that list accordingly. Since a conference is actually a social event, it seems like a better use of time to target papers that I would expect to come up in conversation. I haven't tried this tactic before, so we'll see how things go!

Finally, I'd like to thank the NSF, the ACM, and President Obama for help on my upcoming travel.

[Read More]

Get money, get paid.

Posted on May 17, 2014

When we allow breakoff, we typically tell respondents that they will be paid a bonus commensurate with the amount and quality of work they do at the end of the study. We don't typically inform them of when the study will end, since we may need to re-run surveys and we'd like to keep respondents from trying to game our algorithms. The experiments we did in the fall that awarded bonuses used a tiered system. The ones listed here are using respondent's scores to identify the top 95% of scores and award those respondents two cents per question. I'd like to investigate more sophisticated pricing schemata with Sara in the future.

Wage Survey

It was my intention to only have one wage survey running on AMT. However, as I've been porting the Python analyses into Clojure, I found that I actually had three instances running. Given the expiration dates on the latter two, I'm pretty sure they were posted accidentally. I should probably consider asking the user if their sure they want to continue, or have a safe mode that asks a million times "are you sure you want to do X?" so future users don't make this mistake. There's also small possibility that when I extended the original HIT, it somehow spawned two new HITs instead. This isn't documented anywhere, but it's something I probably want to double check on sandbox.

So we had three surveys running. At the time of our OOPSLA submission, we had the wage survey running for about four days and only accrued 69 responses. I extended that HIT twice. It expired Mar 26 2014, 04:42 PM PDT. The other two HITs had expiration dates of Mar 28 2014, 04:44 PM PDT and Mar 30 2014, 07:22 PM PDT. Each HIT requested 150 assignments, and paid $0.10 base wage per survey. Between the three surveys, we collected 154 responses. Under normal circumstances, I wouldn't have three of the same HIT running concurrently -- a feature I might consider adding to SurveyMan is a check for whether a survey with similar parameters has been posted before. If I implement a "safe mode" version of SurveyMan, I could ask then ask the user if they really want to post this survey.

Anyway, the point is that because I had three versions running, I had repeaters. We had only 132 unique respondents. I typically exclude repeaters from the analyses, since we tell them to return the surveys if they haven't taken them before. After running our new dynamic analyses report, I found that 98 respondents were classified as bad actors. I had a similarly high percentage in the Python analysis and wasn't confident that it was correct. Since we hadn't tested the effects of breakoff on bot classification in simulation, I was hesitant to make any strong assertions about these classifications, without investigating further. Furthermore, since we had so few data points at the time of the OOPSLA submission, we decided to simply report qualitative results.

Examining the larger data set, we found that the maximum number of questions answered by any one respondent was still 26. I'll leave a more thorough analysis of our quality control to another blog post, but the results were quite interesting and corroborated some of my suspicions about the the original set of data. In any case, the number of questions bad actors answered had a high of 18 and a low of 2 (interestingly, those who only answered one question were confined to repeaters. Inspecting manually, I saw 10 repeaters in total, of whom only one appeared to be a legitimate bad actor.).

Using the federal minimum wage of $7.25 an hour and an estimated 10 seconds per question, we should award $0.02 per question. Since we already awarded a base pay of $0.10, we subtracted this quantity off the total payment calculated for honest respondents. The static analyses gave us an average path length of 41 questions and an expected payment of $0.825694 for each respondent who answered the survey to completion. Since we requested 150 responses, if every respondent were categorized as honest and answered the survey to completion, it would cost us $123.8541, not counting the AMT commission. AMT charges a 10% commission on both the base pay and the bonuses. This gives us a total expected cost of $136.23951.

Our actual costs were much lower (although the quality of the data was presumably also lower). Our 154 respondents cost us $22.53 in base pay, including AMT commission. We calculated $7.78 in bonuses to be awarded, costing us $8.558 for the commission. In total, the experiment under these conditions cost us $31.088.

Prototypicality

The next survey we'll look at is Presley's prototypicality survey. We had 149 responses total. The survey had an average path length of 17 questions. The estimated base price for honest respondents who answer the survey in full is $0.342361 per survey. Our expected cost for all honest respondents, plus AMT commission is $56.489565. We classified 65 respondents as bad actors, and 84 as valid responses. These results differ from our initial reported results due to how we calculate the frequencies for questions that are variants. In our Python code, we only compared questions that were exactly the same -- that is, we didn't unify the distributions of the variants. In the Clojure implementation, we first remove bots, and assume that there is no statistical difference between the questions, unifying all the variants. We'll leave a discussion of the pros and cons of this approach to a future blog post.

The result of unifying variants is that significantly more responses are classified as bots, but none of the variants are flagged as being drawn from different distributions! I'll have to double check to make sure that the variant code is running as expected, but since my unit test for flagging variants seems to work, I'm going to assume that the differences we detected were due to outliers. We can discuss what might be going on here in a later blog post, since this one is mostly about pricing. The bonuses to be paid amount to $20.32. With AMT commission this is $22.35. In total, this survey costs $38.742.

Choose Randomly

I ran a survey that wasn't featured in our OOPSLA paper because it was one I made up entirely. The idea was to post a survey with two floating blocks, where both asked respondents to choose one of the responses randomly, but the one block had identical options, whereas the other had arbitrary categories of things. I wanted to see how random people could actually be. I also wanted to track timing information. One of the things I noticed about this survey, which I posted on a Wednesday morning, was that I was able to collect all of my responses within 90 minutes. The time per question here is clearly less than 10s. If it takes about 2 seconds per question, then the expected cost for a completed survey would be about $0.08. As a result, I decided to not award bonuses. The total cost was $16.50.

Phonology

Finally we have our classic phonology survey. With an average path length of 99 questions, we would expect to pay $1.993750 per survey. With a target of 150 responses, our total cost, including commission, would be $328.96875.

We ran this survey 3 times (not counting our preliminary run). The details of each run can be found in an earlier post. We collected 395 responses total and had 311 unique respondents. 22 respondents accounted for the 84 duplicate responses. This initial run cost us $43.45. 182 responses were classified as valid responses. The total bonuses to be paid were calculated to be $327.96. Factoring in AMT commission, this comes to $360.756. In all, this survey will have cost us $404.206.

Notes on timing

Some of the the timing data we returned was flawed, so we couldn't use that information to improve our payment scheme. One possible use of this information would be to use it as a proxy for the difficulty the question and vary the payment in accordance with this information.

[Read More]