About a month ago, Charles Sutton stopped by UMass to give a talk called "Statistical Analysis Of Computer Programs." Here are my lightly-edited, cleaned-up (ish) notes on the talk (this approach inspired by ezyang's amazing note-taking abilities):

conceit -- text as language
(bad metaphor)
computer program = precise set of instructions
people -- more aspects (social interactions)
language a good metaphor as a mean of human communication

when writing code, others may want to use later
next person might be you!
research:
-- lots of open source code online
-- lots of implicit knowledge in how to write software -- libraries api, avoid bugs
-- easy to read, easy to maintain
take implicit knowledge and make explicit
writing code in a new library, look at what people did in the past
suggest patterns
key insight: means of communication
regularities of natural language (NL) may be found in programming languages (PL)
no new machine learning
apply existing techniques to statistical NLP, new patterns
coding conventions
-- names, formatting
summarization
-- get a compressed representation of long verbose source files
mining idioms
-- small syntactics patterns in code
describe how we use them

NL coding conventions
local->global movement of abstraction
what kinds of coding conventions?
examples:
junit example
create input stream in java
create an identifier
name of input stream
maybe know the type of name someone would use who contributed to the junit project
formatting--e.g. braces
coding convention is a syntactic constraint beyond that imposed by the languages grammar
programmers themselves decide to impose on top of what the compiler requires them to do
developers care a lot
style checkers style guide

“small amount” of research in software engineering
how they use these in software engineering...

go through conventions, find out what's important -- big commercial (microsoft) projects

threads different aspects of code committed

38 percent related to conventions rather than functionality
(why i hate code review!)

why not just run a formatter over the code?
corner cases -- it doesn't handle for you
renaming variables to be more consistent → (jsnice)
review time -- can talk more about the functionality
where do conventions come from?
- implicit from code base
one programmer starts, others pick it up
emergent quality
mores, rather than laws
large number of software constraints, well modeled with statistical machine learning
even with a lot of programming experiences, won't know about how things are named
coding convention inference problem --> why not use machine translation to take my conventions and change them to your conventions?

eclipse plugin called devstyle

click on your identifier, will give you a list of other names
renaming suggestions
how should class objects be named
some disagreement with conventions
we can suggest a name
go through a region and rank names
block of code someone wants to add to the project

how large of a corpus would you need?

what kind of technology do you use inside the scoring function?
n-gram language model

smoothing, taking into account the constraints of the compiler/language conventions
constraint is library of changes
only renaming, not generating syntactically correct (would not be syntactically correct)

pull together all uses of the identifier

look at the set of all other names that have been used anywhere in previous or succeeding context
ask ngram language model of joint prob. of entire file -- sounds really expensive
-- actually using any ngram centered about the thing we want to rename

laura: coreference analysis on the code
-- knowing that i and i are the same gives you a nonlinear language mode
-- can you get something more robust
-- tapping the compiler for name resolution
-- how do you corporate into the model (only incorporate into the suggestions -- done post hoc)

score by ngram model, threshold so user doesn't see terrible suggestions
side effect of architecture:
don't want names to be very common
system does not choose really common names
sympathetic uniqueness principle
have variable program entity, give unusual name
have a very domain specific thing, such as smoothing, choose an appropriate name, considered appropriate big statistical properties

in training set, if we have an id that occurs infrequently, give ? for unknown (deals with tails of the distribution)
suggestion process for alternatives, use known token
whatever context, use rare word, don't suggest change

what happens if you have a common word, but there should be an unusual -- don't know if there is an answer to this common/question

like adding a new table in a conditional random field (CRP)

formatting conventions
encoding spacing decisions as tokens (indexed by location, foo)

use same framework or suggestions on this basis

does this thing work?
evaluation methodology
automatic evaluation:
-- doesn't say what this is (is it the generative thing, idempotency -- thats how you would explain)
-- should really being doing a case study no user study or human evaluation or whatever

don't want low precision win this tool because people won’t use it
95% accuracy -- basically, what google does
can do this for other types of identifiers, but its harder
sympathetic uniqueness -- do we rename everything as i?
x axis is similar, how often do we make revisions
y axis -- element of surprise -- things that were rare, what percentage did we incorrectly try to say were something else
set threshold high, no suggestions, no new names
set threshold low, rename everything

methods, variables, and types are very different
variables and types back off
methods are much more surprising all of the time

naturalize tools

final thing -- test on github, submit patches
look at top suggestions on top 5 suggestions
submitted 18 patches
14 accepted
do programmers really care about this kind of name
suggest that their exceptions be renamed from e to t?
t was okay
throwable (t) to e
people accept t
evidence that programmers think about and care about
paper on arxiv -- fowkes, ranca, allamanis, lapata, sutton

question: exists bad users?
- AI complete question

question: hard metrics for successful renaming -- most people like it better?
- programmers are picky
- does it actually make programs easier to maintain

question: fancier language model better?
- yes, think so
- types are names recovering -- very conventional (very GP)
- java -- names are 1:1 correspondence to class

question: run on dynamic languages?
-- no results
-- corpus of java, c, python
don't know if its run on python -- i don't think it would work as well on python because it is not as redundant as java

new topic:
autofolding to summarize code

summarize, compress out java boilerplate
use code folding (which obviously uses the fact that we know blocks are denoted by braces)

is the summary just folding?

difference audiences
task based vs non task based
experience versus novice
expert in project
first look problem -- opening single file first time, get overview

TASSAL -- tree based autofolding software summarization algorithm
start with file, parse, say we want to compress certain types -- block statements, comments

foldable tree -- subset of AST that contains nodes we could consider following

file → bag of words -- split identifiers by camel case
some of these are going to be generic java stuff
some are concepts used throughout the project

topic models -- find characterizing words for files and packages; tried method as leaves, but this was too sparse

for each node, pick a mixing distribution

single topic per file -- packages or other levels of abstraction <-- wonder if you could use this for refactoring

-- fit the model
this will give us for each token in the source file an indicator variable whether this thing is generated from java, package, file (how characteristic)

think of optimization problem
vector u binary vector
each element indicates whether node is folded
-- okay, so topic models for summarization, dug
look at all tokens assigned to file via generative process
empirical distribution of nodes included in the summary

constraints within budget for length of the summary

tree consistency
-- if a node is included in the summary, must include parents

optimise via greedy algorithm

question: what about naming or other conventions that are drawn from different natural language distributions?
-- clusters of developers that following different conventions -- cluster developers together with both formatting naming -- models don't work as well

question: what if i have multiple devs collaborating ON THE SAME FILE?
-- topic per continent
-- run topic per content
-- where does z come from?

taking topic models and applying to name? not done yet

look at example topics
example columns are topics
three background topics (e.g. get string baclue name type object i)
projects: spring, bigbluebutton
files: datasourceutils, qualsp

to evaluate:
create gold standard
folded files manually to measure precision and recall
compare with
javadocs -- always include, add random
shallowest to deep
expand nodes in order of length
heuristic, but that’s all eclipse is doing -- comparing with state of the art

second:
show summaries to developers
6 developers, avg. 4 years industrial experience
-- rate conciseness and usefulness
these are more concise and useful
automatic summaries from TASSAL were better than any of the other baselines

third thing:
mining idioms from code using existing NLP tools

what are code idioms?

example: reading into a buffer, iterating over an array
-- are these all things that are encapsulated by other abstractions?
opening resources/context -- common pattern
need meta variables

code idiom is a syntactic code fragment that recurs frequently across software projects and has a single semantic purposes

-- wondering if you could learn to match ASTs from one language to another (from one that has these higher level abstractions to one that doesn't)

idiom-related tools -- intellij and eclipse

no way of identifying which idioms are useful (presumably to add them to the IDEs -- how do you find new ones)

other types of code patterns
-- surface level -- code clones copy past code garments
api patterns usage patterns of methods

idiom mining problem --
can i find these templates?

use a probabilistic grammar
CFG slides, pCFG slide

use tree substitution grammar -- generalization of a tree joining grammar
non-terminal can expand into a tree instead of a list of terminals and nonterminals

can make a probabilistic version

tree substitution grammar over tree nodes and regular expansions
represents a family of idioms

this will allows us to represent these idioms

input: probabilistic tree substitution grammar within
take a corpus of ASTs
learn the grammar
every tree rule in the TSG that i learn i treat as an idiom

convert into a textual representation

build a library of idioms to show developers
how do we infer the grammar?

maximum likelihood conditioned on the pTSG rules

previous work from sharon goldwater et al
-- infer what these trees are, pick list of trees that best explain the corpus
-- number of possible things i could put in theta are intractable, maximum likelihood is degenerate, pick 1:1 rules to trees

don't make tree fragments too big
put a prior on probabilistic grammars
if you're going to add another tree, this is what the idiom would like
get a joint distribution over pCFGs and source files
dist. over dist. of parse trees

given a corpus code of code to get a distribution over probabilistic grammars over trees i've inferred
type-based MCMC from liang et al; (think this is from liang’s GP-like work)

some questions i didn't catch

mined idioms
iterator, loop through lines, logger for class, string constant
get patterns you would actually find
get something from actual APIs
e.g. database transaction (opening a resource and cleaning up properly)
get the distance between two points in ??
jsoup get mhtl

lots of work in SE in API mining
no syntactically nested things

take a held out set of files
percentage of AST nodes explained by the ---/
existing method for clone detection
completely duplicated ?

copy paste phenomenon

idioms we find occur across projects much more often…?
SE perspective -- dozens of papers in SE about copy past clones

if these things are really idioms, maybe they will occur more often in example code -- actually what happens
from a data set of regular projects on github, we find 22% of idioms found are actually used in examples, higher in stack overflow

finally, can do a co-occurrence matrix -- how are idioms used across different projects
eclipse snipmatch
-- open source addition to eclipse -- manually took 44 snippets, stuff worked or something
cant put all the idioms in the tool, so many found
considered this as validation that the thing works
interesting that there i was one idiom used that is considered bad practice

exploiting that source code is a means of human communication

maybe surprising to people who are from a different background, that you would need to train the model