Woah, it's been a while since I actually made a blog post. I'm going to to try to make a blog post every day until I head off for my Facebook internship in June (super psyched!). I'm putting it in writing, so that's how you know I'm definitely going to break my word...
Anyway, back to the complaint at hand. Today was the final machine learning class of the semester and people were presenting their projects. One of the groups wasn't really a group, so much as two students working loosely on the same approach to different problems. At the end of their presentation, I asked what commonalities there were between the two projects, other than both using CRFs to model the problem. They said that they were both trying to show that some kind of regularization would cause improvement on the state of the art of their respective data sets. I then asked if they were using the same base code. The students responded that they were not: one was implemented in Matlab and the other in Scala. They had reasons for this (mainly, the Scala student uses Scala in his lab, and he seemed to have scared off the Matlab student from using Scala because of its "learning curve" -- see, Scala, you seriously need some better PR). I was troubled that they weren't using comparable backends, since this meant they would definitely not be able to generalize without modeling the impact of measurement on their results. I didn't ask if they were implementing the state of the art in their systems, so as to make them comparable (although I should have). In the end, I thought maybe it wouldn't have that much of an impact.
At the end of the class, I picked up my final "mini-project." We were given some sample output to test against and for my implementation of a one-vs-all multiclass classifier using a polynomial kernel, I was only able to get a test accuracy of 78.70%, versus the expected test accuracy of 87.3%. The TA (whom I don't mean to criticize -- he's done a great job in this class, and I really appreciate it!) had written "too low, but I couldn't find errors in your code. May be some small bugs." Now, I had originally implemented other versions of other parts of this project; they too were also too low. Those older versions of the code were written using for-loops. After porting my loops over to matrix multiplications, I was able to improve the accuracy of my test set to equal that of the provided checkpoints. All this is to say that I strongly suspect the offending code (then one-vs-all multiclass classifier using a polynomial kernel) is semantically equivalent to a matrix-operation-only version, but will produce "better" results. This (of course!) is an exercise in dealing with numeric stability.
The futility of comparisons?
So of course looking at my homework and reflecting on my question for my colleagues, I felt confirmation that since we can't even compare semantically equivalent computations when written in the same language, how can we compare across languages? Part of the reason why people use libraries like BLAS is because they are very efficient, but it also has the effect of determinism when comparing against other implementations -- we don't have faithful models of computation to provide meaningful comparisons between implementations when people use different floating point operations on different architectures. These issues are clearly tied to problems of reproducibility that have been getting a lot of buzz in our community lately.
Immediately before class I had seen Andrew Gelman's post on how we undervalue measurement in statistics. Although numerical analysis has been tackled by a lot of people who are way smarter than I am, it still causes problems for practitioners. With the rise of deep learning, big data, and data science, more and more people are using tools whose shortcomings are non-obvious. As a computer scientist, I am aware that floating point numbers are problematic, and that compounded floating point error can be a problem. I was still shocked by how much of a problem it was. I expected a small amount of variation, but was surprised that my result deviated as much as is did. Of course, the problem could still be in the code itself -- after all, it isn't statically verified. However, I still noticed differences in test accuracy for the other code that I converted from loops to matrix operations. These differences were on the order of a few percentage points. Although an order of magnitude smaller than the difference in the kernelized one-vs-all classifier, the difference of two percentage points or so actually makes a result publishable (at least in some ML/NLP conferences I've seen).
I wonder if we would get some variation in the test accuracy using loops just from randomizing the layout of some junk under the surface. Lisps have been partially addressing numerical stability by having ratio types. There must be some languages that do some semi-symbolic computations, simplifying and deferring floating point operations until a value is needed. If there were true, they would be able to return some bound on the error of that computation.