Config File Input Schema
We would like to add assertions, annotations, and maybe even types to the PlanOut language. However, since it’s infeasible to rewrite our corpus of scripts in this updated language, we have a temporary solution: we are parsing in YAML config files to provide supplementary type information for inference.
Each datum in the config file associates a variable name with its properties.
Valid Properties
-
card We can annotate any variable’s cardinality as high or low. We use cardinality in two places: when analyzing experimental validity and to check on the values being logged. When analyzing experimental validity, we need to know whether the units of randomization have sufficiently high cardinality. Low cardinality variables will likely cause imbalance in randomization. High cardinality in logged variables will tax the experimentation system, in the worst case causing it to crash.
Here, ⊕ denotes any of the set of operators in {(), ×, +}, where () is the tuple operator. When we combine two variables using one of these operators, so long as one of them has high cardinality1, the result of the ⊕ operator will have high cardinality. The most common usecase we have for this is when the unit of randomization is a tuple. Conversely, using the modulus operator demotes the result of two variables to the lowest cardinality amongst them. This makes the ⊕ operator the meet and the % operator the join:
When we examine the behavior of the operators, we observe relations that don’t fit in as well into the above framework:
Subtraction looks a lot like XOR, but division is totally weird and not commutative. We’ll see later how these properties interact to help us infer the soundness of our estimator generation algorithm.
-
dynamism In a given experiment, we generally consider the factors and the covariates to have the same value through the duration of the experiment. Holding these values constant allows us to reason about the outcome of interest (\(Y\)). What happens when the variables in our experiment do vary? In this case, we will need to consider how they vary and their interaction with the assignment mechanism. Since we are only looking at ATE at this point, we cannot consider dynamic variables in our estimator generation. Dynamism pollutes assignment. This property can take on values constant and tv (for time-varying).
-
randomness Variables can be marked as random or nonrandom. Since it is possible to refer to an external source of randomness, we would like to be able to mark variables in the system as random when this happens. If a variable has been marked as random, but is not, then we should alert the user, sinec it will reduce the number of valid estimators in the system. If a variable has been marked as nonrandom but is random, then we will also need to alert the user.
-
domain We can specify the the domain of external random variables and use this to help typecheck. The domain cannot include types outside of our system : they must be booleans, numbers, strings, or one of the container types (json, array, or map).
-
codomain If the variable listed is a function, we may also want to specify its codomain. Again, the codomain must be one of our known types.
-
domain_y The estimator inference algorithm assumes that the unit of ranomization is correlated with \(Y\) – if it weren’t, we would never be able to detect an effect! However, there are cases when it may appear that our unit of randomization allows us to compute a between-subjects estimator, but we cannot.
-
corr We should allow users to list other variables we expect to be correlated with this variable. Then when looking at independence assumptions, we can add latent causal variables to account for unobserved causal variables.
-
Note that although it might be possible to multiply two low cardinality variables and get back a high cardinality one, in the worst case this is not likely. Since our static analyses are conservative, we assume that the product of two low-cardinality operators also has low cardinality ↩