Reproducibility and Privacy

What would it take to have an open database for various scientific experiments? An increasing number of researchers are posting data online, and many are willing to (and required) to share their data if you ask for it. This is fine for a single experiment, but what if you'd like to reuse data from two different studies?

There is a core group of AMT respondents who are very active. Sometimes, AMT respondent contact requesters, at which point they are no longer anonymous. My colleague, Dan Barowy received an email from a respondent, thanking him for the quality of the HIT. I asked him the respondents name and as it turned out, they had contacted me when I was running my experiments as well.

So if we have the general case of trying to pair similar pieces of data into a unit (i.e. person) and the specific case of AMT workers who are definitely the same people (they have unique identifiers), how can we combine this information in a way that's meaningful? In the case of the AMT workers, we will need to obfuscate some information for the sake of privacy. For other sources of data, could we take specific data, infer something about the population, and build a statistical "profile" of that population to use as input to another test? Clearly we can use standard techniques to learn summary information about a population, but could we take pieces of data and unify them into a single entity and say with high probability these measurements are within some epsilon of a "true" respondent? How would we use the uncertainty inherent in this unification to ensure privacy?

Is it possible to unify data in such a way that an experimenter could execute a query asking for samples of some observation, and get a statistically valid Frankenstein version of that sample? I'm sure there's literature out there on this. Might be worth checking into...

On Experiments vs Surveys (Ramble Time!) » « SocialSci