Toybox: Rethinking Atari Benchmarks
I have been working with John Foley, Kaleigh Clary, and David Jensen on developing a new testing framework for reinforcement learning. I designed a prototype version that my collaborators and I used to investigate the behavior of generalized agents in our work, Measuring and Characterizing Generalization in Deep Reinforcement Learning. I presented Toybox at the IBM AI Systems Day and as a poster at the 2018 NeurIPS Systems for ML Workshop. We are currently preparing a full-length conference submission. If you are interested in contributing to our suite, please join our Slack team.
PlanAlyzer: Static Anlaysis for Online Field Experiments
A few years ago now, Eytan Bakshy designed a DSL for online field experiments, PlanOut. I had the great fortune to intern with him at Facebook, where we investigated possible applications of static analysis to the experimental design space. While we started out focusing on generating contrasts and inferring probability of treatment assignment, we found there were errors and threats to validity of the experiments that only existed at the intersection of programs and experiments. At various points in this project, I worked with UMass folks Emery Berger and Eliot Moss on the PL side of things, and had a wonderful resource in causal inference in David Jensen. I presented this work at NEPLS in 2016 and OOPSLA 2019.
SurveyMan: Programming and Debugging Surveys
I worked with Emery Berger on a programming language and runtime system to design, debug, and deploy scientific surveys on the web. We collaborated with Joe Pater from Linguistics; this work became my Synthesis Project, which earned an Outstanding Synthesis Project. We first presented SurveyMan at the 2014 Off the Beaten Track workshop. This work won first place at the PLDI Student Research Competition. We later presented the full paper at OOPSLA 2014, where it won a best paper award (3 awarded in total).
COSMOS: A New Experimental Criterion for Genetic Programming
Typical Genetic Programming experiments focus on the convergence of the best candidate program in a population. Practicioners often use both parametric and non-parametric hypothesis tests to compare metrics such as computational effort or mean-best-fitness. Lee Spector and I found that the number of independent runs needed to ensure reliability of such tests varied greatly across problems. We proposed a new criterion for determining the number of independent runs required to make inferences about a set of Genetic Programming techniques. Presented at the First Workshop for Understanding Problems, GECCO 2012.
Creating Conversational Characters Using Question Generation Tools. Xuchen Yao, Emma Tosch, Grace Chen, Elnaz Nouri, Ron Artstein, Anton Leuski, Kenji Sagae, David Traum. Dialogue and Discourse, Volume 3, Issue 2.
Evaluating Conversational Characters Created through Question Generation. Grace Chen, Emma Tosch, Ron Artstein, Anton Leuski, David Traum. Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference