30 October 14.

The difference between Statistics and Data Science

link PDF version

An academic field is a social network built around an accepted set of methods.

Economics has grown into the study of human decision making in all sorts of aspects. At this point, nobody finds it weird that some of the most heavily-cited papers in the Economics journals are about the decision to commit crimes or even suicide. These papers use methods accepted by economists, writing down utility functions and using certain mathematical techniques to extract information from these utility functions. Anthropologists also study suicide and crime, but using entirely different methods. So do sociologists, using another set of tools. To which journal you submit your paper on crime depends on matching the methods you use to the methods readers will be familiar with, not on the subject.

A notational digression: I hate the term `data science'. First, there's a general rule (that has exceptions) that anybody who describes what they're doing as “science'' is not a scientist—self-labelling like that is just trying too hard. And are we implying that other scientists don't use data? Is it the data or the science that statisticians are lacking? Names are just labels, but I'll hide this one under an acronym for the rest of this. I try to do the same with the United States DHS.

I push that the distinction is about the set of salient tools because I think it's important to reject other means of cleaving apart the Statistics and DS networks. Some just don't work well and some are as odious as any other our people do it like this, but the other people do it like this kind of generalizations. These are claims about how statisticians are too interested in theory and too willing to assume a spherical cow, or that DSers are too obsessed with hype and aren't careful with hypothesis testing. Hadley explains that “...there is little work [in Statistics] on developing good questions, thinking about the shape of data, communicating results or building data products'' which is a broad statement about the ecosystem that a lot of statisticians would dispute, and a bit odd given that he is best known for building tools to help statisticians build data products. It's not hard to find people who say that DS is more applied than Stats, which is an environment issue that is hard to quantify and prone to observation bias. From the comment thread of this level-headed post: “I think the key differentiator between a Data Scientist and a Statistician is in terms of accountability and commitment.''

Whatever.

We can instead focus on characterizing the two sets of tools. What is common knowledge among readers of a Stats journal and what is common knowledge among readers of a DS journal?

It's a subjective call, but I think it's uncontroversial to say that the abstract methods chosen by the DSers rely more heavily on modern computing technique than commonly-accepted stats methods, which tend to top out in computational sophistication around Markov Chain Monte Carlo.

One author went to the extreme of basically defining DS as the practical problems of data shunting and building Hadoop clusters. I dispute that any DSer would really accept such a definition, and even the same author effectively retracted his comment a week later after somebody gave him an actual DS textbook.

If you want to talk about tools in the sense of using R versus using Apache Hive, the conversation won't be very interesting to me but will at least be a consistent comparison on the same level. If we want to talk about generalized linear models versus support vector machines, that's also consistent and closer to what the journals really care about.

The basic asymmetry that the price of admission for using DS techniques is greater computational sophistication will indeed have an effect on the people involved. If we threw a random bunch of people at these fields, those who are more comfortable with computing will sort themselves into DS and those less comfortable into Stats. We wind up with two overlapping bell curves of computing ability, such that it is not challenging to find a statistician-DSer pair where the statistician is a better programmer, but in expectation a randomly drawn DSer writes better code than a randomly drawn statistician. So there's one direct corollary of the two accepted sets of methods.

Three Presidents of the ASA wrote on the Stats vs DS thing, and eventually faced the same technical asymmetry:

Ideally, statistics and statisticians should be the leaders of the Big Data and data science movement. Realistically, we must take a different view. While our discipline is certainly central to any data analysis context, the scope of Big Data and data science goes far beyond our traditional activities.

This technical asymmetry is a real problem for the working statistician, and statisticians are increasingly fretting about losing funding—and for good reason. Methods we learned in Econ 101 tell us that an unconstrained set leads to an unambiguously (weakly) better outcome than a constrained set.

If you're a statistician who is feeling threatened, the policy implications are obvious: learn Python. Heck, learn C—it's not that hard, especially if you're using my C textbook, whose second edition was just released (or Modeling with Data, which this blog is ostensibly based on). If you have the grey matter to understand how the F statistic relates to SSE and SSR, a reasonable level of computing technique is well within your reach. It won't directly score you publications (DSsers can be as snobby about how writing code is a “mere clerical function'' as the statisticians and US Federal Circuit can be), but you'll have available a less constrained set of abstract tools.

If you are in the DS social network, an unconstrained set of tools is still an unambiguous improvement over a constrained set, so it's worth studying what the other social network takes as given. Some techniques from the 1900s are best left in the history books, but now and then you find ones that are exactly what you need—you won't know until you look.

By focusing on a field as a social network built around commonly accepted tools, we see that Stats and DS have more in common than differences, and can (please) throw out all of the bigotry that comes with searching for differences among the people or whatever environment is prevalent this week. What the social networks will look like and what the labels are a decade from now is not something that we can write a policy for (though, srsly, we can do better than “data science''). But as individuals we can strive to be maximally inclusive by becoming conversant in the techniques that the other social networks are excited by.

Next time, I'll have more commentary derived from the above definition of academic fields, then it'll be back to the usual pedantry about modeling technique.


23 October 14.

Bayes v Kolmogorov

link PDF version

$\def\Re{{\mathbb R}} \def\datas{{\mathbb D}} \def\params{{\mathbb P}} \def\models{{\mathbb M}} \def\mod#1{M_{#1}}$

We have a likelihood function that takes two inputs, which we will name the data and the parameter, and which gives the nonnegative likelihood of that combination, $L: d, p \to \Re^+$. [I wrote a lot of apropos things about this function in an early blog post, by the way.]

The two inputs are symmetric in the sense that we could slice the function either way. Fixing $p=\rho$ defines a one-parameter function $L_\rho: d\to \Re^+$; fixing $d=\delta$ defines a one-parameter function $L_\delta: p \to \Re^+$.

But the inputs are not symmetric in a key way, which I will call the unitary axiom (it doesn't seem to have a standard name). It's one of Kolmogorov's axioms for constructing probability measures. The axiom states that, given a fixed parameter, some value of $d$ will be observed with probability one. That is, \begin{equation} \int_{\forall \delta} L_\rho(\delta) d\delta = 1, \forall \rho. \end{equation} In plain language, when we live in a world where there is one fixed underlying parameter, one data point or another will be observed with probability one.

This is a strong statement, because we read the total density as an indication of the likelihood of the parameter taking on the given value. I tell you that $p=3$, and we check the likelihood and see that the total density on that state of the world is one. Then you tell me that, no, $p=4$, and we refer to $L(d, 4)$, and see that it integrates to one as well.

Somebody else comes along and points out that this may work for discrete-valued $p$, but a one-dimensional slice isn't the right way to read a continuous density, insisting that we consider only ranges of parameters, such as $p\in[2.75,3.25]$ or $p \in [3.75,4.25]$. But if the integral over a single slice is always one, then the double integral is easy: $\int_{\rho\in[2.75,3.25]}\int_{\forall \delta} L(\delta, \rho) d\delta d\rho$ $=\int_{\rho\in[2.75,3.25]} 1 d\rho$ $=.5$, and the same holds for $p \in [3.75,4.25]$. We're in the same bind, unable to use the likelihood function to put more density on one set of parameters compared to any other of the same size.

This rule is asymmetric, by the way, because if we had all the parameters in the universe, whatever that might mean, and a fixed data set $\delta$, then $\int_{\forall \rho} L_\delta(\rho) d\rho$ could be anything.

Of course, we don't have all the data in the universe. Instead, we gather a finite quantity of data, and find the more likely parameter given that subset of the data. For example, we might observe the data set $\Delta=\{2, 3, 4\}$ and use that to say something about a parameter $\mu$. I don't want to get into specific functional forms, but for the sake of discussion, say that $L(\Delta, 2)=.1$; $L(\Delta, 3)=.15$; $L(\Delta, 4)=.1$. We conclude that three is the most likely value of $\mu$.

What if we lived in an alternate universe where the unitary axiom didn't hold? Given a likelihood function $L(d, p)$ that conforms to the unitary axiom, let $$L'(d, p)\equiv L(d, p)\cdot f(p),$$ where $f(p)$ is nonnegative and finite but otherwise anything. Then the total density on $\rho$ given all the data in the universe is $\int_{\forall \delta} L_{\rho}(\delta)f(\rho) d\delta = f(\rho)$.

For the sake of discussion, let $f(2)=.1$, $f(3)=.2$, $f(4)=.4$. Now, when we observe $\Delta=\{2, 3, 4\}$, $L'(\Delta, 2)=.01$, $L'(\Delta, 3)=.03$, $L'(\Delta, 4)=.04$, and we conclude that $\mu=4$ is the most likely value of $p$.

Bayesian updating is typically characterized as a composition of two functions, customarily named the prior and the likelihood. In the notation here, these are $f(p)$ and $L(d, p)$. Without updating, all values of $p$ are equally likely in the world described by $L$, until data is gathered. The prior breaks the unitary axiom, and specifies that, even without gathering data, some values of $p$ are more likely than others. When we do gather data, our prior belief that some values of $p$ are more likely than others advises our beliefs.

Our belief about the relative preferability of one value of $p$ over another could be summarized into a proper distribution, but once again, there is no unitary axiom requiring that a distribution over the full parameter space integrate to one. For example, the bridge from the Bayesian-updated story to the just-a-likelihood story is the function $f(\rho)=1, \forall \rho$. This is an improper distribution, but it does express that each value of $p$ has the same relative weight.

In orthodox practice, everything we write down about the data follows the unitary axiom. For a given observation, $L'(\delta, p)$ is a function of one variable, sidestepping any issues about integrating over the space of $d$. We may require that this univariate function integrate to one, or just stop after stating that $L'(\delta, p) \propto f(p)L(\delta, p)$, because we usually only care about ratios of the form $L'(\delta, \rho_1)/L'(\delta, \rho_2)$, in which case rescaling is a waste of time.

In a world where all parameters are observable and fixed, the unitary axiom makes so much sense it's hard to imagine not having it. But in a meta-world where the parameter has different values in different worlds, the unitary axiom implies that all worlds have an equal slice of the likelihood's density. We usually don't believe this implication, and Bayesian updating is our way of side-stepping it.


14 July 14.

Microsimulation games, table top games

link PDF version

I wrote a game. It's called Bamboo Harvest, and you can see the rules at this link. You can play it with a standard deck of cards and some counters, though it's much closer to the sort of strategic games I discuss below than poker or bridge. I've played it with others and watched others play it enough to say it's playable and pretty engaging. Ms NGA of Baltimore, MD gets really emotional when she plays, which I take as a very good sign.

Why am I writing about a game on a web page about statistical analysis and microsimulation? I will leave to others the topic of Probability theory in table top games, but there is also a lot that we who write economic models and microsimulations of populations can learn from game authors. After all, the designers of both board games and agent-based models (ABMs) have the same problem: design a set of rules such that the players in the system experience an interesting outcome.

Over the last few decades, the emergent trend among board games have been so-called Eurogames, which are aimed at an adult audience, seek greater interaction among players, and typically include an extensive set of rules regarding resource trading and development. That is, the trend has been toward exactly the sort of considerations that are typical to agent-based models.

A game that has resource exchange rules that are too complex, or is simple enough to be easily `solved' will not have much success in the market. In most games, the optimal move in any given situation could theoretically be solved for by a hyperrational player. But the fact that players find them to be challenging demonstrates that the designers have found the right level of rule complexity for a rational but not hyperrational adult. We seek a similar complexity sweet spot in a good ABM. Readers can't get lost in all the moving parts, but if the model is so simple that readers know what your model will do before it is run—if there's no surprise—then it isn't worth running.

Of course, we are unconcerned as to whether our in silico agents are having any fun or not. Also, we get to kill our agents at will.

Simulation designers sometimes have a sky's-the-limit attitude, because processor time is cheap, but game designers are forced by human constraints to abide by the KISSWEA principle (keep it simple, stupid, without extraneous additions). It's interesting to see what game designers come up with to resolve issues of simultaneity, information provision and hiding, and other details of implementation, when the players have only counters and pencil and paper.

Market and supply chain

Settlers of Catan is as popular as this genre of games get—I saw it at a low-end department store the other day on the same shelf as Monopoly and Jenga. It is a trading game. Each round a few random resources—not random players—are productive, which causes gluts and droughts for certain resources, affecting market prices. The mechanics of the market for goods are very simple. Each player has a turn, and they can offer trades to other players (or all players) on their turn. This already creates interesting market dynamics, without the need for a full open-outcry marketplace or bid-ask book, which would be much more difficult to implement at the table or in code. How an agent decides to trade can also be coded into an artificial player, as demonstrated by the fact that there are versions of Settlers you can play against the computer.

Some games, like Puerto Rico, Race for the Galaxy, Bootleggers, and Settlers again, are supply chain games. To produce a victory point in Puerto Rico, you have to get fields, then get little brown immigrants to work the fields (I am not making this up), then get a factory to process the crops, then sell the final product or ship it to the Old World. There may be multiple supply chains (corn, coffee, tobacco). The game play is basically about deciding which supply chains to focus on and where in the supply chain to put more resources this round. The game design is about selecting a series of relative prices so that the cost (in time and previous supply-chain items) makes nothing stand out as a clear win.

One could program simple artifical agents to play simple strategies, and if one is a runaway winner with a strategy (produce only corn!) then that is proof that a relative price needs to be adjusted and the simulation redone. That is, the search over the space of relative prices maximizes an objective function regarding interestingness and balance. ABMers will be able to immediately relate, because I think we've all spent time trying to get a simple model to not run away with too many agents playing the same strategy.

I'm not talking much about war games, which seem to be out of fashion. The central mechanism of a war game is an attack, wherein one player declares that a certain set of resources will try to eliminate or displace a defending resource, and the defender then declares what resources will be brought to defense. By this definition, Illuminati is very much a war game; Diplomacy barely is. Design here is also heavily about relative prices, because so much of the game is about which resources will be effective when allocated to which battles.

Timing

How does simultaneous action happen when true simultaneity is impossible? The game designers have an easy answer to simultaneously picking cards: both sides pick a card at a leisurely pace, put the card on the table, and when all the cards are on the table, everybody reveals. There are much more complicated means of resolving simultaneous action in an agent-based model, but are they necessary? Diplomacy has a similar simultaneous-move arrangement: everybody picks a move, and an arbitration step uses all information to resolve conflicting moves.

Puerto Rico, San Juan, and Race for the Galaxy have a clever thing where players select the step in the production chain to execute this round, so the interactive element is largely in picking production chain steps that benefit you but not opponents. Setting aside the part where agents select steps, the pseudocode would look like this:

for each rôle:
    for each player:
        player executes rôle

Typical program designs make it really easy to apply a rôle function to an array of players. Josh Tokle implements a hawk and dove game via Clojure. His code has a game-step where all the birds play a single hawk-and-dove game from Game Theory 101, followed by all executing the death-and-birth-step, followed by all taking a move-step.

It's interesting when Puerto Rico and Race for the Galaxy have this form, because it's not how games usually run. The usual procedure is that each player takes a full turn executing all phases:

for each player:
    for each rôle:
        player executes rôle

I'd be interested to see cases where the difference in loop order matters or doesn't.

Topology

One short definition of topology is that it is the study of what is adjacent to what.

The Eurogamers seem to refer to the games with very simple topologies as abstracts—think Go or Chess. Even on a grid, the center is more valuable in Chess (a center square is adjacent to more squares than an edge square) and the corners are more valuable in Go (being adjacent to fewer squares $\Rightarrow$ easier to secure).

Other games with a board assign differential value to areas via other means. War games typically have maps drawn with bottlenecks, so that some land is more valuable than others. Small World has a host of races, and each region is a valuable target for some subset of races.

I'm a fan of tile games, where the map may grow over time (check out Carcassonne), or what is adjacent to what changes over the course of the game (Infinite City or Illuminati).

Other games have a network topology; see Ticket to Ride, where the objective is to draw long edges on a fixed graph.

War games often extol complexity for the sake of complexity in every aspect of the game, so I'm going to set those aside. But the current crop of Eurogames tend to focus on one aspect (topology or resource management or attack dynamics) and leave the other aspects to a barebones minimum of complicatedness. Settlers has an interesting topology and bidding rules, and the rest of the game is basically just mechanics. Carcasonne has the most complex (and endogenous) topology of anything I'm discussing here, so the resource management is limited to counting how many identical counters you have left. Race for the Galaxy, Puerto Rico, and Dominion have crazy long lists of goods and relative prices, so there is no topology and very limited player interaction rules—they are almost parallel solitaire. A lot of card games have a complete topology, where every element can affect every other.

An example: Monopoly

Back up for a second to pure race games, like Pachisi (I believe Sorry! is a rebrand of a Pachisi variant). Some have an interactive element, like blocking other opponents. Others, aimed at pre-literate children, like Chutes and Ladders or Candyland, are simply a random walk. Ideally, they are played without parental involvement, because adults find watching a pure random walk to be supremely dull. Adults who want to ride a random walk they have no control over can invest in the stock market.

Monopoly is a parallel supply chain game: you select assets to buy, which are bundled into sets, and choose which sets you want to build up with houses and hotels. On top of this is a Chutes and Ladders sort of topology, where you go around a board in a generally circular way at random speed, but Chance cards and a Go to Jail square may cause you to jump position.

The original patent has an explanation for some of these details—recall that Monopoly was originally a simulation of capital accumulation in the early 20th century:

Mother earth: Each time a player goes around the board he is supposed to have performed so much labor upon mother earth, for which after passing the beginning-point he receives his wages, one hundred dollars[...].

Poorhouse: If at any time a player has no money with which to meet expenses and has no property upon which he can borrow, he must go to the poorhouse and remain there until he makes such throws as will enable him to finish the round.

You have first refusal on unowned properties that your token lands on (then they go up for auction, according to the official rules that a lot of people ignore), and you owe rent when your token lands on owned properties, and Mother earth periodically pays you \$200. All of these cash-related events are tied to the board movement, which is not the easiest or most coherent way to cause these events to occur. E.g., how would the game be different if you had a 40-sided die and randomly landed on squares all around the board? Would the game be more focused if every player had a turn consisting of [income, bid on available land, pay rent to sleep somewhere] phases?

The confounding of supply chain game with randomization via arbitrary movement is what makes it succesful, because the Chutes and Ladders part can appeal to children (the box says it's for 8 year-old and up), while the asset-building aspects are a reasonable subgame for adults (although it is unbalanced: a competent early leader can pull unsurpassably ahead). But it is the death of Monopoly as a game for adults, because there are too many arbitrary moving parts about going around an arbitrary track.

I can't picture a modern game designer putting together this sort of combination of elements. I sometimes wonder if the same sort of question could be asked of many spatial ABMs (including ones I've written): is the grid a key feature of the game, or just a mechanism to induce random interactions with a nice visualization?

Conclusion

Microsimulation designers and for-fun game designers face very similar problems, and if you're writing microsimulations, it is often reasonable to ask how would a board game designer solve this problem?. I discussed several choices for turn order, trading, topology, and other facets, and in each case different choices can have a real effect on outcomes. In these games that are engaging enough to sell well, the game designers could only select a nontrivial choice for one or two facets, which become the core of the game, and other facets are left to the simplest possible mechanism, to save cognitive effort by players.

Also, now that you've read all that, I can tell you that Bamboo Harvest focuses on a shifting-tiles topology, with a relatively simple supply chain. We decided against marketplace/trading rules.


12 July 14.

Intercoder agreement: the R script and its tests

link PDF version

Here, I will present an R script to calculate an information-based measure of intercoder agreement.

The short version: we have two people putting the same items into bins, and want to know how often they are in agreement about the bins. It should be complexity-adjusted, because with only two bins, binning at random achieves 50% agreement, while with 100 bins binning at random produces 1% agreement. We can use mutual information as a sensible measure of the complexity-adjusted agreement rate. A few more steps of logic, and we have this paper in the Journal of Official Statistics describing $P_i$, a measure of intercoder agreement via information in agreement. I also blogged this paper in a previous episode.

There are two features of the paper that are especially notable for our purposes here. The first is that I said that the code is available upon request. Somebody called me out on that, so I sent him the code below. Second, the paper has several examples, each with two raters and a list of their choices, and a carefully verified calculation of $P_i$. That means that the tests are already written.

The code below has two functions. We could turn it into a package, but it's not even worth it: just source("p_i.R") and you've got the two defined functions. The p_i function does the actual calculation, and test_p_i runs tests on it. As in the paper, some of the tests are extreme cases like full agreement or disagreement, and others are average tests that I verified several times over the course of writing the paper.

Could it be better? Sure: I don't do a very good job of testing the code for really pathological cases, like null inputs or something else that isn't a matrix. But the tests give me a lot of confidence that the p_i function does the correct thing given well-formed inputs. It's not mathematically impossible for a somehow incorrect function to give correct answers for all six tests, but with each additional test the odds diminish.

Here is the code. Feel free to paste it into your projects, or fork it from Github and improve upon it—I'll accept pull requests with improvements.


p_i <- function(dataset, col1=1, col2=2){

    entropy <- function(inlist){
        -sum(sapply(inlist, function(x){log2(x)*x}), na.rm=TRUE)
    }

    information_in_agreement <- function(diag, margin1, margin2){
        sum <- 0
        for (i in 1:length(diag))
            if (diag[i] != 0)
                sum <- sum + diag[i]*log2(diag[i]/(margin1[i]*margin2[i]))
        return (sum)
    }

    dataset <- as.data.frame(dataset) #in case user provided a matrix.
    crosstab <- table(as.data.frame(cbind(dataset[,col1],dataset[,col2])))
    d1tab <- table(dataset[,col1])
    d2tab <- table(dataset[,col2])
    d1tab <- d1tab/sum(d1tab)
    d2tab <- d2tab/sum(d2tab)
    crosstab <- crosstab/sum(crosstab)

    entropy_1 <- entropy(d1tab)
    entropy_2 <- entropy(d2tab)
    ia <- information_in_agreement(diag(crosstab), d1tab, d2tab)
    return (2*ia/(entropy_1+entropy_2))
}

test_p_i <- function(){
    fullagreement <- matrix(
                    c(1,1,1,1,2,2,2,2,3,3,
                      1,1,1,1,2,2,2,2,3,3),
                    ncol=2, byrow=FALSE
                )

    stopifnot(p_i(fullagreement)==1)

    noagreement <- matrix(
                    c(1,2,1,2,1,2,3,1,3,2,
                      2,1,3,1,2,3,2,2,1,3),
                    ncol=2, byrow=FALSE
                )

    stopifnot(p_i(noagreement)==0)

    constant <- matrix(
                    c(1,1,1,1,1,1,
                      1,1,2,2,2,3),
                    ncol=2, byrow=FALSE
                )

    stopifnot(p_i(constant)==0)

    neg_corr <- matrix(
                    c(1,1,1,1,1,2,2,2,2,2,
                      1,2,2,2,2,1,1,1,1,2),
                    ncol=2, byrow=FALSE
                )

    stopifnot(abs(p_i(neg_corr)- -.2643856) < 1e-6)

    rare_agreement <- matrix(
                    c(1,1,1,2,1,2,2,2,3,3,
                      1,1,1,1,2,2,2,2,3,3),
                    ncol=2, byrow=FALSE
                )

    stopifnot(abs(p_i(rare_agreement)- .6626594) < 1e-6)

    common_agreement <- matrix(
                    c(1,1,1,1,2,2,2,3,2,3,
                      1,1,1,1,2,2,2,2,3,3),
                    ncol=2, byrow=FALSE
                )

    stopifnot(abs(p_i(common_agreement)- 0.6130587) < 1e-6)
}


9 May 14.

Testing statistical software III: the contract

link PDF version

So far, I've given a brief intro to the mechanics of assertions and tests, which you can use to increase your own and others' confidence in your code, and I gave some examples of brainstorming theorems to provide constraints that your function's output has to meet.

The thesis sentence to this part is that the tests are the embodiment of a contract you have with the users. Last time, I gave some suggestions about how to test a matrix inversion function, which the bureaucrat would write as a bullet-pointed contract:

  • The output will be such that $X\cdot Inv(X)=I$ to within some tolerance (see below).
  • If the input is $I$, the output will be $I$.
  • If the input is symmetric, the output will be symmetric.

There it is in English; the test is the contract in executable form.

Writing a numeric routine isn't expressionist improvisation: you've got to conceive of what the function does before you write it. And the test routine is the formalization of what the function promises. The typical workflow is to write the tests after you write the function to be tested, but that makes no sense here. Because the contract and test are siamese twins, and you wrote the contract before writing the function, it makes sense to write the test before writing the function as well. The term for writing the test/contract first and then writing a function that conforms to it is test-driven development, and it's a sensible work habit that should probably be used more (even by myself).

You are going to sit down and write a routine or three to read in a data set, run a regression, and extract output. Same questions: what's the contract you expect, and how much of it can you express as a test ahead of time? Yes, I know that statistical analysis really is expressionist improvisation, and if we knew what we were doing we wouldn't call it science, and exploration is an art upon which we mustn't impose structure. But it's much more fruitful when you explore a data set you can state with confidence was read in correctly, and when an alarm rings if you regress $Y$ on $Y$. Lewis and Clark kept a log of problems they ran into and surmounted; data explorers can too. The big technological plus we modern explorers have is that we can re-execute our log when the next data set comes in.

Also, once you have the contract/test, the documentation almost writes itself. For the special cases you worked out, show them as examples users can cut/paste/verify; for the relatively odd things about symmetry and such, present them as additional useful facts. For some cases you may need more; I suggested a lot of tests for a Bayesian updating routine last time, but the sum of them really aren't enough to fully describe what Bayesian updating is about.

Testing the special cases

And don't forget the weird special cases. When I started writing numeric software, I'd sometimes brush off the one-in-a-million edge cases, because, hey, what're the odds that something like that could happen. Which is an easy question to answer: if you send five million independent observations to each be run through a function, a one-in-a-million event has a 99.3% chance of occurring at least once.

Even if you are only sending in a handful of data points to a function that doesn't handle the special cases, the Law of Irony Maximization states that you'll hit that one-in-a-million case anyway.

I hope you find a mathematically clean rule for handling all the edge cases. Hint: your math platform has a floating-point representation for Not-a-Number (NaN), infinity, and -infinity. But in some cases, you just have to come up with an arbitrary rule. Then, add clauses to the contract/tests/documentation. I dunno, how about:

  • If the input is a zero matrix, I return a square of infinities.
  • If the input isn't a square matrix, I return NaN.

To pull a line from a pop song about the endogeneity of social relations, you alone decide what's real. But if you express the rules you wrote in the contract/tests/documentation, then users know exactly what they're getting.

As with legal contracts, the special cases sometimes take many more lines to detail than the simple basic idea. So it goes.

Tolerance

Of course, $X\cdot Inv(X)$ is exactly the identity matrix to machine precision only when we are very lucky. All of my tests wind up with a series of assertions making use of this Diff macro:

#define Diff(L, R, eps) {\
    if(fabs((L)-(R))>(eps)) {    \
        printf(#L "=%g is too different from " #R "=%g (arbitrary limit=%g).\n", (double)(L), (double)(R), eps); \
        abort(); \
    }\
}

//sample usage:
Diff(2./3, .6666, 1e-2); //pass
Diff(2./3, .6666, 1e-8); //fail

The tolerances to which your tests fit are part of the contract, a bit of documentation for those of us who read tests to see what to expect from these functions. Again, you alone decide the tolerance level to test to, and then it's there in the contract for the user to work with.

I used to try really hard to get eps as small as possible, but Apophenia's tests ship to the public, and I'd get emails from people telling me that on their machine, some test produces a result where L-R in the above was 1e-4, but the test listed the max tolerance as 5e-5. It would be nice if that could be improved, but it's very low on my to-do list. I'd typically change all the internal variables from double to long double and consider that case closed. I leave the tolerances looser now.

Anyway, once you've achieved a reasonable level of accuracy, nobody really cares. In all the endless promotion and bragging about how awesome the latest awesome math package is, how often do you see anybody brag that their package is more precise than the others? We can't really do statistics on arbitrary-precision systems (yet), and those people who truly rely on high precision so they can send their probe to Mars need to do the due diligence of retesting the code anyway.

But tolerances are not to be brushed off entirely, because sometimes these little things really do indicate errors. Maybe you should have divided by $n-1$ instead of $n$.

I may be the only person to have calculated the exact formula for unbiased sample kurtosis (PDF) without recourse to Taylor expansions—it is not a pleasant traipse through mathemagic land. There was a draft of the apop_vector_kurtosis function where the best tolerance I could get was around $1e-2$. In this case, the low tolerance really was telling me something: there was a math typo in my transcription of the messy formula. The test cases now pass to within a tolerance around $1e-5$ which is stellar, given that everybody else uses an approximate formula from the start. Tell all your friends: Apophenia has the most precise unbiased sample kurtosis around.

Coverage

When I run my main test script, I cover over 90% of the code, which I know thanks to gcov, a coverage checker that accompanies the GNU Compiler Collection. It lists which lines get hit and how often, which means I know which lines are still completely untested. Because if every piece of code is covered by a contract/test, what does it mean when we find code that isn't tested?

Once I've got the list of untested lines, I've got decisions to make about whether they should be tested and how (or whether to cut them entirely, because they're not part of a contract). There's a sort of consensus that 85% is about the right amount of coverage, and it's not worth shooting for 100%. And I concur: being endowed with only finite time, writing tests to cover that last 15% is typically low priority. Anyway, given that line count is meaningless, percent coverage can also be gamed and is not a statistic worth really obsessing over.

Many platforms have a coverage checker of some sort, that behave analogously to gcov for C. However, no coverage checker is available for some computing platforms for which there are rather complex packages available. This freaks me out. If the authors of a complex package have no assurance that every part of the package was tested, then you as a user of that package have no such assurance.

Regression testing

Every time I make a change in Apophenia, I re-compile and re-run every piece of public code I have written that relies on it, including the ships-with-the-library tests, the examples and semi-public solutions for Modeling with Data, the githubbed code for this site, the examples from the documentation, and a few other sundry items. If the tests pass, I know that my change has not broken any of the contracts to date.

If I had stuck to ad hoc tests and eyeballing things, I'd have nothing. But as the code base grew, the size of the little side-tests I added to the master test script grew at a corresponding rate. Small side projects have a few small side-tests; the magnum opera have a formidable automated list of checks.

Back at the beginning of part one, I recommended some simple checks, like whether all the entries in the age category are positive and under 120. Here at the end, those tests have added up to a contract that not only covers everything in the code base, but that can actually guide how the code is written to begin with. The work is much more challenging here in floating-point land, because the larger units can be very difficult to test, and even small blocks require some creativity in coming up with little theorems to assert about outputs, and we will always be only within some tolerance of the true value anyway. But don't let that daunt you: because it is rarely obvious when a calculated number is incorrect, we should be especially vigilant in adding tests and assertions to raise our confidence.

Next time I'll give a relatively simple example, which is mostly an excuse to post some code that I've been meaning to post for a long time.


5 May 14.

Testing statistical software II: there's a theorem somewhere

link PDF version

Last time, I discussed the value of adding assertions to your code, to check guarantees that the code should follow.

I talked about ad hoc assertions as part of a program, which thus run every time the program runs. This flows into unit testing proper, wherein a given unit of code is accompanied by a separate unit of tests, which is typically run only when something changes. Which approach you take is mostly a question of logistics. You've got options.

This time I'll make some suggestions on testing numeric and statistical results. This is a lot more difficult than typical CompSci functions, where there is rarely a stochastic element, and if it's broken it's blatantly obvious. When you run your Bayesian updating via MCMC search and get a posterior with mean 3.218, is that right? In situations like this, there's no direct way to prove that the functions used to produce that result are correct, but there are always many means of increasing confidence.

Brainstorming some theorems

The reason that you are writing code to begin with is that there is a theorem somewhere indicating that the method you are using to calculate a value is somehow useful. That theorem probably says that the result you're getting equals some known and useful value or has some useful property. For example, if you wrote a matrix-inverse function, you know there is a theorem that states that $X\cdot Inv(X) = I$ for any non-degenerate $X$ (but see next time about tolerances).

So let's brainstorm some theorems in some typical situations. Few would be something in a textbook with the heading Theorem, but most will be something you can prove or verify relatively easily.

We can start with the special cases: if the input is the identity matrix or zero, it is often trivial to calculate the output. So there's a petit theorem, that the identity matrix or zero will produce a given output, which we can write up as our first test. Such easy and obvious things are a good first place to start, because it is especially easy to set up the inputs and outputs for the test.

For simpler calculations, you may be able to go beyond trivial inputs to a haphazard example you calculated by hand, in which case you have another petit theorem looking something like if the input is $[2, 3, 4]$, then the output is $12.2$.

Can you say something about all possible outputs? Maybe they should all be greater than zero, or otherwise bounded. Maybe the output can't be greater than the sum of the inputs. If the input has some sort of symmetry, will the output be symmetric? With these little theorems, we can write a test that generates a few thousand random inputs and checks that the condition holds for all of them.

John D Cook wrote a book chapter on testing random number generators. The chapter is basically a brainstorming session about things that can be confidently stated about a distribution. For example, there is a theorem that a CDF constructed from draws from a distribution should be `close' to the original distribution, so a Kolmogorov-Smirnov test comparing the two should give a good test statistic. He has posted lots of other interesting little tidbits on testing.

Delta

Sometimes we know how the output should change given changes in the input. The same Kolmogorov-Smirnov test can be used to express the confidence with which we believe a data set was drawn from a given theoretical distribution. Draw 1,000 values from a Normal$(0, 1)$. I may not know exactly what the $p$-value will be for a test comparing that to a proper Normal$(0, 1)$, but I do know that that $p$-value will get larger when looking at a Normal$(.1, 1)$, and it will get larger still when compared to a Normal$(.2, 1)$. If outputs from my K-S test statistic calculator don't follow that pattern, something is wrong.

For that introductory example where the posterior mean is 3.218, you probably have control of the prior mean, and the posterior mean should rise as your prior mean rises (but not as much as you raised the prior mean).

If I run an ordinary linear regression on random data, I'll get some set of results. I may not know what they look like, but I do know that doubling all of the values ($X$ and $Y$, rescaling the whole problem) shouldn't change any of the slopes, and will double the intercept. This is another test I can run on a thousand random inputs.

Inverses

We often have theorems about inverses: given your function $f$, there is a function $g$ such that we are guaranteed that $g(f(x)) = x$. If such a function exists, then there's another test. For example, if I wrote a matrix-inversion function, I expect that in non-pathological cases Inv(Inv($X$))=$X$ (again, see next time about tolerances).

Much of Apophenia's model testing looks like this. I make a thousand random draws from a model given parameters $P$, that produces a data set $D$. I know that if I estimate the parameters of the same model using the data set $D$, then the calculated parameters should be close to $P$. I basically define the RNG to be the function such that this (fix parameters--draw--estimate parameters) round trip works. So there's a test.

Now map that same set of a thousand random draws $d_1, d_2, …$ to a set of a thousand CDF values, $CDF(d_1), CDF(d_2), …$. This will be Uniformly distributed, so there's another test of RNG + CDF.

Especially with these round-trips, I find brainstorming for petit theorems to be pretty fun. Writing code that works reliably requires a lot of time doing mechanical things closer to transforming square pegs so they'll fit in round holes than actual math. A good test takes a function that does some nontrivial calculation, then does another completely unrelated calculation, and produces exactly the expected result. When one of these sets of transformations work for the first time, it feels less like a petit theorem and more like a petit miracle. For a few moments, I feel like the world works. I get that buzz that drives us to do more math.

Write everything twice

Sometimes, the theorem proves that the efficient and graceful method you just implemented is equivalent to a tedious and computationally intensive alternative. So, there's a test. It's unfortunate, because now you have to sit down and code the ungraceful long way of doing things. At least it only has to work for a handful of test cases. E.g., for exponential families, you can Bayesian update by looking up the posterior on the conjugate tables, or you can do MCMC, and the output distributions should approach identical.

At my workplace, we have a lot of people who are PhDs in their field but amateurs in the coding world, so they don't have a concept of unit testing. Instead, everything reasonably important is double-coded, wherein two independent parties independently write the whole darn program. I tell you this for perspective, so you see that double-coding the tedious version of a specific routine to check specific inputs is not so bad.

We might have another code base somewhere else that calculates the same thing, and can compare our results to theirs. Or, we may have a complete run with a certain data set that we've eyeballed closely and are confident is correct. I'm reluctant to rely on this for testing. There is no theorem that the other guy did the math right. Such compare-the-printout tests work as a regression test (see next time), but if you make legitimate changes, like improving precision or rearranging the output, now you've gotta rewrite the checks. I personally rely on such argument by authority only for eyeballing things and for checking print output routines, and even that much has been a headache for me.

Anything sufficiently well understood can have assertions written about it. I have no idea what sort of mathematical result inspired you to write the code you're writing, but I hope that some of the examples above will help you as you brainstorm petit theorems about your own work. As noted last time, none of these tests prove that your code is Correct, but each shows that you wrote something that meets increasingly tight constraints, and gives users a little more confidence that your code is reliable.