14 July 14.

Microsimulation games, table top games

link PDF version

I wrote a game. It's called Bamboo Harvest, and you can see the rules at this link. You can play it with a standard deck of cards and some counters, though it's much closer to the sort of strategic games I discuss below than poker or bridge. I've played it with others and watched others play it enough to say it's playable and pretty engaging. Ms NGA of Baltimore, MD gets really emotional when she plays, which I take as a very good sign.

Why am I writing about a game on a web page about statistical analysis and microsimulation? I will leave to others the topic of Probability theory in table top games, but there is also a lot that we who write economic models and microsimulations of populations can learn from game authors. After all, the designers of both board games and agent-based models (ABMs) have the same problem: design a set of rules such that the players in the system experience an interesting outcome.

Over the last few decades, the emergent trend among board games have been so-called Eurogames, which are aimed at an adult audience, seek greater interaction among players, and typically include an extensive set of rules regarding resource trading and development. That is, the trend has been toward exactly the sort of considerations that are typical to agent-based models.

A game that has resource exchange rules that are too complex, or is simple enough to be easily `solved' will not have much success in the market. In most games, the optimal move in any given situation could theoretically be solved for by a hyperrational player. But the fact that players find them to be challenging demonstrates that the designers have found the right level of rule complexity for a rational but not hyperrational adult. We seek a similar complexity sweet spot in a good ABM. Readers can't get lost in all the moving parts, but if the model is so simple that readers know what your model will do before it is run—if there's no surprise—then it isn't worth running.

Of course, we are unconcerned as to whether our in silico agents are having any fun or not. Also, we get to kill our agents at will.

Simulation designers sometimes have a sky's-the-limit attitude, because processor time is cheap, but game designers are forced by human constraints to abide by the KISSWEA principle (keep it simple, stupid, without extraneous additions). It's interesting to see what game designers come up with to resolve issues of simultaneity, information provision and hiding, and other details of implementation, when the players have only counters and pencil and paper.

Market and supply chain

Settlers of Catan is as popular as this genre of games get—I saw it at a low-end department store the other day on the same shelf as Monopoly and Jenga. It is a trading game. Each round a few random resources—not random players—are productive, which causes gluts and droughts for certain resources, affecting market prices. The mechanics of the market for goods are very simple. Each player has a turn, and they can offer trades to other players (or all players) on their turn. This already creates interesting market dynamics, without the need for a full open-outcry marketplace or bid-ask book, which would be much more difficult to implement at the table or in code. How an agent decides to trade can also be coded into an artificial player, as demonstrated by the fact that there are versions of Settlers you can play against the computer.

Some games, like Puerto Rico, Race for the Galaxy, Bootleggers, and Settlers again, are supply chain games. To produce a victory point in Puerto Rico, you have to get fields, then get little brown immigrants to work the fields (I am not making this up), then get a factory to process the crops, then sell the final product or ship it to the Old World. There may be multiple supply chains (corn, coffee, tobacco). The game play is basically about deciding which supply chains to focus on and where in the supply chain to put more resources this round. The game design is about selecting a series of relative prices so that the cost (in time and previous supply-chain items) makes nothing stand out as a clear win.

One could program simple artifical agents to play simple strategies, and if one is a runaway winner with a strategy (produce only corn!) then that is proof that a relative price needs to be adjusted and the simulation redone. That is, the search over the space of relative prices maximizes an objective function regarding interestingness and balance. ABMers will be able to immediately relate, because I think we've all spent time trying to get a simple model to not run away with too many agents playing the same strategy.

I'm not talking much about war games, which seem to be out of fashion. The central mechanism of a war game is an attack, wherein one player declares that a certain set of resources will try to eliminate or displace a defending resource, and the defender then declares what resources will be brought to defense. By this definition, Illuminati is very much a war game; Diplomacy barely is. Design here is also heavily about relative prices, because so much of the game is about which resources will be effective when allocated to which battles.


How does simultaneous action happen when true simultaneity is impossible? The game designers have an easy answer to simultaneously picking cards: both sides pick a card at a leisurely pace, put the card on the table, and when all the cards are on the table, everybody reveals. There are much more complicated means of resolving simultaneous action in an agent-based model, but are they necessary? Diplomacy has a similar simultaneous-move arrangement: everybody picks a move, and an arbitration step uses all information to resolve conflicting moves.

Puerto Rico, San Juan, and Race for the Galaxy have a clever thing where players select the step in the production chain to execute this round, so the interactive element is largely in picking production chain steps that benefit you but not opponents. Setting aside the part where agents select steps, the pseudocode would look like this:

for each rôle:
    for each player:
        player executes rôle

Typical program designs make it really easy to apply a rôle function to an array of players. Josh Tokle implements a hawk and dove game via Clojure. His code has a game-step where all the birds play a single hawk-and-dove game from Game Theory 101, followed by all executing the death-and-birth-step, followed by all taking a move-step.

It's interesting when Puerto Rico and Race for the Galaxy have this form, because it's not how games usually run. The usual procedure is that each player takes a full turn executing all phases:

for each player:
    for each rôle:
        player executes rôle

I'd be interested to see cases where the difference in loop order matters or doesn't.


One short definition of topology is that it is the study of what is adjacent to what.

The Eurogamers seem to refer to the games with very simple topologies as abstracts—think Go or Chess. Even on a grid, the center is more valuable in Chess (a center square is adjacent to more squares than an edge square) and the corners are more valuable in Go (being adjacent to fewer squares $\Rightarrow$ easier to secure).

Other games with a board assign differential value to areas via other means. War games typically have maps drawn with bottlenecks, so that some land is more valuable than others. Small World has a host of races, and each region is a valuable target for some subset of races.

I'm a fan of tile games, where the map may grow over time (check out Carcassonne), or what is adjacent to what changes over the course of the game (Infinite City or Illuminati).

Other games have a network topology; see Ticket to Ride, where the objective is to draw long edges on a fixed graph.

War games often extol complexity for the sake of complexity in every aspect of the game, so I'm going to set those aside. But the current crop of Eurogames tend to focus on one aspect (topology or resource management or attack dynamics) and leave the other aspects to a barebones minimum of complicatedness. Settlers has an interesting topology and bidding rules, and the rest of the game is basically just mechanics. Carcasonne has the most complex (and endogenous) topology of anything I'm discussing here, so the resource management is limited to counting how many identical counters you have left. Race for the Galaxy, Puerto Rico, and Dominion have crazy long lists of goods and relative prices, so there is no topology and very limited player interaction rules—they are almost parallel solitaire. A lot of card games have a complete topology, where every element can affect every other.

An example: Monopoly

Back up for a second to pure race games, like Pachisi (I believe Sorry! is a rebrand of a Pachisi variant). Some have an interactive element, like blocking other opponents. Others, aimed at pre-literate children, like Chutes and Ladders or Candyland, are simply a random walk. Ideally, they are played without parental involvement, because adults find watching a pure random walk to be supremely dull. Adults who want to ride a random walk they have no control over can invest in the stock market.

Monopoly is a parallel supply chain game: you select assets to buy, which are bundled into sets, and choose which sets you want to build up with houses and hotels. On top of this is a Chutes and Ladders sort of topology, where you go around a board in a generally circular way at random speed, but Chance cards and a Go to Jail square may cause you to jump position.

The original patent has an explanation for some of these details—recall that Monopoly was originally a simulation of capital accumulation in the early 20th century:

Mother earth: Each time a player goes around the board he is supposed to have performed so much labor upon mother earth, for which after passing the beginning-point he receives his wages, one hundred dollars[...].

Poorhouse: If at any time a player has no money with which to meet expenses and has no property upon which he can borrow, he must go to the poorhouse and remain there until he makes such throws as will enable him to finish the round.

You have first refusal on unowned properties that your token lands on (then they go up for auction, according to the official rules that a lot of people ignore), and you owe rent when your token lands on owned properties, and Mother earth periodically pays you \$200. All of these cash-related events are tied to the board movement, which is not the easiest or most coherent way to cause these events to occur. E.g., how would the game be different if you had a 40-sided die and randomly landed on squares all around the board? Would the game be more focused if every player had a turn consisting of [income, bid on available land, pay rent to sleep somewhere] phases?

The confounding of supply chain game with randomization via arbitrary movement is what makes it succesful, because the Chutes and Ladders part can appeal to children (the box says it's for 8 year-old and up), while the asset-building aspects are a reasonable subgame for adults (although it is unbalanced: a competent early leader can pull unsurpassably ahead). But it is the death of Monopoly as a game for adults, because there are too many arbitrary moving parts about going around an arbitrary track.

I can't picture a modern game designer putting together this sort of combination of elements. I sometimes wonder if the same sort of question could be asked of many spatial ABMs (including ones I've written): is the grid a key feature of the game, or just a mechanism to induce random interactions with a nice visualization?


Microsimulation designers and for-fun game designers face very similar problems, and if you're writing microsimulations, it is often reasonable to ask how would a board game designer solve this problem?. I discussed several choices for turn order, trading, topology, and other facets, and in each case different choices can have a real effect on outcomes. In these games that are engaging enough to sell well, the game designers could only select a nontrivial choice for one or two facets, which become the core of the game, and other facets are left to the simplest possible mechanism, to save cognitive effort by players.

Also, now that you've read all that, I can tell you that Bamboo Harvest focuses on a shifting-tiles topology, with a relatively simple supply chain. We decided against marketplace/trading rules.

12 July 14.

Intercoder agreement: the R script and its tests

link PDF version

Here, I will present an R script to calculate an information-based measure of intercoder agreement.

The short version: we have two people putting the same items into bins, and want to know how often they are in agreement about the bins. It should be complexity-adjusted, because with only two bins, binning at random achieves 50% agreement, while with 100 bins binning at random produces 1% agreement. We can use mutual information as a sensible measure of the complexity-adjusted agreement rate. A few more steps of logic, and we have this paper in the Journal of Official Statistics describing $P_i$, a measure of intercoder agreement via information in agreement. I also blogged this paper in a previous episode.

There are two features of the paper that are especially notable for our purposes here. The first is that I said that the code is available upon request. Somebody called me out on that, so I sent him the code below. Second, the paper has several examples, each with two raters and a list of their choices, and a carefully verified calculation of $P_i$. That means that the tests are already written.

The code below has two functions. We could turn it into a package, but it's not even worth it: just source("p_i.R") and you've got the two defined functions. The p_i function does the actual calculation, and test_p_i runs tests on it. As in the paper, some of the tests are extreme cases like full agreement or disagreement, and others are average tests that I verified several times over the course of writing the paper.

Could it be better? Sure: I don't do a very good job of testing the code for really pathological cases, like null inputs or something else that isn't a matrix. But the tests give me a lot of confidence that the p_i function does the correct thing given well-formed inputs. It's not mathematically impossible for a somehow incorrect function to give correct answers for all six tests, but with each additional test the odds diminish.

Here is the code. Feel free to paste it into your projects, or fork it from Github and improve upon it—I'll accept pull requests with improvements.

p_i <- function(dataset, col1=1, col2=2){

    entropy <- function(inlist){
        -sum(sapply(inlist, function(x){log2(x)*x}), na.rm=TRUE)

    information_in_agreement <- function(diag, margin1, margin2){
        sum <- 0
        for (i in 1:length(diag))
            if (diag[i] != 0)
                sum <- sum + diag[i]*log2(diag[i]/(margin1[i]*margin2[i]))
        return (sum)

    dataset <- as.data.frame(dataset) #in case user provided a matrix.
    crosstab <- table(as.data.frame(cbind(dataset[,col1],dataset[,col2])))
    d1tab <- table(dataset[,col1])
    d2tab <- table(dataset[,col2])
    d1tab <- d1tab/sum(d1tab)
    d2tab <- d2tab/sum(d2tab)
    crosstab <- crosstab/sum(crosstab)

    entropy_1 <- entropy(d1tab)
    entropy_2 <- entropy(d2tab)
    ia <- information_in_agreement(diag(crosstab), d1tab, d2tab)
    return (2*ia/(entropy_1+entropy_2))

test_p_i <- function(){
    fullagreement <- matrix(
                    ncol=2, byrow=FALSE


    noagreement <- matrix(
                    ncol=2, byrow=FALSE


    constant <- matrix(
                    ncol=2, byrow=FALSE


    neg_corr <- matrix(
                    ncol=2, byrow=FALSE

    stopifnot(abs(p_i(neg_corr)- -.2643856) < 1e-6)

    rare_agreement <- matrix(
                    ncol=2, byrow=FALSE

    stopifnot(abs(p_i(rare_agreement)- .6626594) < 1e-6)

    common_agreement <- matrix(
                    ncol=2, byrow=FALSE

    stopifnot(abs(p_i(common_agreement)- 0.6130587) < 1e-6)

9 May 14.

Testing statistical software III: the contract

link PDF version

So far, I've given a brief intro to the mechanics of assertions and tests, which you can use to increase your own and others' confidence in your code, and I gave some examples of brainstorming theorems to provide constraints that your function's output has to meet.

The thesis sentence to this part is that the tests are the embodiment of a contract you have with the users. Last time, I gave some suggestions about how to test a matrix inversion function, which the bureaucrat would write as a bullet-pointed contract:

  • The output will be such that $X\cdot Inv(X)=I$ to within some tolerance (see below).
  • If the input is $I$, the output will be $I$.
  • If the input is symmetric, the output will be symmetric.

There it is in English; the test is the contract in executable form.

Writing a numeric routine isn't expressionist improvisation: you've got to conceive of what the function does before you write it. And the test routine is the formalization of what the function promises. The typical workflow is to write the tests after you write the function to be tested, but that makes no sense here. Because the contract and test are siamese twins, and you wrote the contract before writing the function, it makes sense to write the test before writing the function as well. The term for writing the test/contract first and then writing a function that conforms to it is test-driven development, and it's a sensible work habit that should probably be used more (even by myself).

You are going to sit down and write a routine or three to read in a data set, run a regression, and extract output. Same questions: what's the contract you expect, and how much of it can you express as a test ahead of time? Yes, I know that statistical analysis really is expressionist improvisation, and if we knew what we were doing we wouldn't call it science, and exploration is an art upon which we mustn't impose structure. But it's much more fruitful when you explore a data set you can state with confidence was read in correctly, and when an alarm rings if you regress $Y$ on $Y$. Lewis and Clark kept a log of problems they ran into and surmounted; data explorers can too. The big technological plus we modern explorers have is that we can re-execute our log when the next data set comes in.

Also, once you have the contract/test, the documentation almost writes itself. For the special cases you worked out, show them as examples users can cut/paste/verify; for the relatively odd things about symmetry and such, present them as additional useful facts. For some cases you may need more; I suggested a lot of tests for a Bayesian updating routine last time, but the sum of them really aren't enough to fully describe what Bayesian updating is about.

Testing the special cases

And don't forget the weird special cases. When I started writing numeric software, I'd sometimes brush off the one-in-a-million edge cases, because, hey, what're the odds that something like that could happen. Which is an easy question to answer: if you send five million independent observations to each be run through a function, a one-in-a-million event has a 99.3% chance of occurring at least once.

Even if you are only sending in a handful of data points to a function that doesn't handle the special cases, the Law of Irony Maximization states that you'll hit that one-in-a-million case anyway.

I hope you find a mathematically clean rule for handling all the edge cases. Hint: your math platform has a floating-point representation for Not-a-Number (NaN), infinity, and -infinity. But in some cases, you just have to come up with an arbitrary rule. Then, add clauses to the contract/tests/documentation. I dunno, how about:

  • If the input is a zero matrix, I return a square of infinities.
  • If the input isn't a square matrix, I return NaN.

To pull a line from a pop song about the endogeneity of social relations, you alone decide what's real. But if you express the rules you wrote in the contract/tests/documentation, then users know exactly what they're getting.

As with legal contracts, the special cases sometimes take many more lines to detail than the simple basic idea. So it goes.


Of course, $X\cdot Inv(X)$ is exactly the identity matrix to machine precision only when we are very lucky. All of my tests wind up with a series of assertions making use of this Diff macro:

#define Diff(L, R, eps) {\
    if(fabs((L)-(R))>(eps)) {    \
        printf(#L "=%g is too different from " #R "=%g (arbitrary limit=%g).\n", (double)(L), (double)(R), eps); \
        abort(); \

//sample usage:
Diff(2./3, .6666, 1e-2); //pass
Diff(2./3, .6666, 1e-8); //fail

The tolerances to which your tests fit are part of the contract, a bit of documentation for those of us who read tests to see what to expect from these functions. Again, you alone decide the tolerance level to test to, and then it's there in the contract for the user to work with.

I used to try really hard to get eps as small as possible, but Apophenia's tests ship to the public, and I'd get emails from people telling me that on their machine, some test produces a result where L-R in the above was 1e-4, but the test listed the max tolerance as 5e-5. It would be nice if that could be improved, but it's very low on my to-do list. I'd typically change all the internal variables from double to long double and consider that case closed. I leave the tolerances looser now.

Anyway, once you've achieved a reasonable level of accuracy, nobody really cares. In all the endless promotion and bragging about how awesome the latest awesome math package is, how often do you see anybody brag that their package is more precise than the others? We can't really do statistics on arbitrary-precision systems (yet), and those people who truly rely on high precision so they can send their probe to Mars need to do the due diligence of retesting the code anyway.

But tolerances are not to be brushed off entirely, because sometimes these little things really do indicate errors. Maybe you should have divided by $n-1$ instead of $n$.

I may be the only person to have calculated the exact formula for unbiased sample kurtosis (PDF) without recourse to Taylor expansions—it is not a pleasant traipse through mathemagic land. There was a draft of the apop_vector_kurtosis function where the best tolerance I could get was around $1e-2$. In this case, the low tolerance really was telling me something: there was a math typo in my transcription of the messy formula. The test cases now pass to within a tolerance around $1e-5$ which is stellar, given that everybody else uses an approximate formula from the start. Tell all your friends: Apophenia has the most precise unbiased sample kurtosis around.


When I run my main test script, I cover over 90% of the code, which I know thanks to gcov, a coverage checker that accompanies the GNU Compiler Collection. It lists which lines get hit and how often, which means I know which lines are still completely untested. Because if every piece of code is covered by a contract/test, what does it mean when we find code that isn't tested?

Once I've got the list of untested lines, I've got decisions to make about whether they should be tested and how (or whether to cut them entirely, because they're not part of a contract). There's a sort of consensus that 85% is about the right amount of coverage, and it's not worth shooting for 100%. And I concur: being endowed with only finite time, writing tests to cover that last 15% is typically low priority. Anyway, given that line count is meaningless, percent coverage can also be gamed and is not a statistic worth really obsessing over.

Many platforms have a coverage checker of some sort, that behave analogously to gcov for C. However, no coverage checker is available for some computing platforms for which there are rather complex packages available. This freaks me out. If the authors of a complex package have no assurance that every part of the package was tested, then you as a user of that package have no such assurance.

Regression testing

Every time I make a change in Apophenia, I re-compile and re-run every piece of public code I have written that relies on it, including the ships-with-the-library tests, the examples and semi-public solutions for Modeling with Data, the githubbed code for this site, the examples from the documentation, and a few other sundry items. If the tests pass, I know that my change has not broken any of the contracts to date.

If I had stuck to ad hoc tests and eyeballing things, I'd have nothing. But as the code base grew, the size of the little side-tests I added to the master test script grew at a corresponding rate. Small side projects have a few small side-tests; the magnum opera have a formidable automated list of checks.

Back at the beginning of part one, I recommended some simple checks, like whether all the entries in the age category are positive and under 120. Here at the end, those tests have added up to a contract that not only covers everything in the code base, but that can actually guide how the code is written to begin with. The work is much more challenging here in floating-point land, because the larger units can be very difficult to test, and even small blocks require some creativity in coming up with little theorems to assert about outputs, and we will always be only within some tolerance of the true value anyway. But don't let that daunt you: because it is rarely obvious when a calculated number is incorrect, we should be especially vigilant in adding tests and assertions to raise our confidence.

Next time I'll give a relatively simple example, which is mostly an excuse to post some code that I've been meaning to post for a long time.

5 May 14.

Testing statistical software II: there's a theorem somewhere

link PDF version

Last time, I discussed the value of adding assertions to your code, to check guarantees that the code should follow.

I talked about ad hoc assertions as part of a program, which thus run every time the program runs. This flows into unit testing proper, wherein a given unit of code is accompanied by a separate unit of tests, which is typically run only when something changes. Which approach you take is mostly a question of logistics. You've got options.

This time I'll make some suggestions on testing numeric and statistical results. This is a lot more difficult than typical CompSci functions, where there is rarely a stochastic element, and if it's broken it's blatantly obvious. When you run your Bayesian updating via MCMC search and get a posterior with mean 3.218, is that right? In situations like this, there's no direct way to prove that the functions used to produce that result are correct, but there are always many means of increasing confidence.

Brainstorming some theorems

The reason that you are writing code to begin with is that there is a theorem somewhere indicating that the method you are using to calculate a value is somehow useful. That theorem probably says that the result you're getting equals some known and useful value or has some useful property. For example, if you wrote a matrix-inverse function, you know there is a theorem that states that $X\cdot Inv(X) = I$ for any non-degenerate $X$ (but see next time about tolerances).

So let's brainstorm some theorems in some typical situations. Few would be something in a textbook with the heading Theorem, but most will be something you can prove or verify relatively easily.

We can start with the special cases: if the input is the identity matrix or zero, it is often trivial to calculate the output. So there's a petit theorem, that the identity matrix or zero will produce a given output, which we can write up as our first test. Such easy and obvious things are a good first place to start, because it is especially easy to set up the inputs and outputs for the test.

For simpler calculations, you may be able to go beyond trivial inputs to a haphazard example you calculated by hand, in which case you have another petit theorem looking something like if the input is $[2, 3, 4]$, then the output is $12.2$.

Can you say something about all possible outputs? Maybe they should all be greater than zero, or otherwise bounded. Maybe the output can't be greater than the sum of the inputs. If the input has some sort of symmetry, will the output be symmetric? With these little theorems, we can write a test that generates a few thousand random inputs and checks that the condition holds for all of them.

John D Cook wrote a book chapter on testing random number generators. The chapter is basically a brainstorming session about things that can be confidently stated about a distribution. For example, there is a theorem that a CDF constructed from draws from a distribution should be `close' to the original distribution, so a Kolmogorov-Smirnov test comparing the two should give a good test statistic. He has posted lots of other interesting little tidbits on testing.


Sometimes we know how the output should change given changes in the input. The same Kolmogorov-Smirnov test can be used to express the confidence with which we believe a data set was drawn from a given theoretical distribution. Draw 1,000 values from a Normal$(0, 1)$. I may not know exactly what the $p$-value will be for a test comparing that to a proper Normal$(0, 1)$, but I do know that that $p$-value will get larger when looking at a Normal$(.1, 1)$, and it will get larger still when compared to a Normal$(.2, 1)$. If outputs from my K-S test statistic calculator don't follow that pattern, something is wrong.

For that introductory example where the posterior mean is 3.218, you probably have control of the prior mean, and the posterior mean should rise as your prior mean rises (but not as much as you raised the prior mean).

If I run an ordinary linear regression on random data, I'll get some set of results. I may not know what they look like, but I do know that doubling all of the values ($X$ and $Y$, rescaling the whole problem) shouldn't change any of the slopes, and will double the intercept. This is another test I can run on a thousand random inputs.


We often have theorems about inverses: given your function $f$, there is a function $g$ such that we are guaranteed that $g(f(x)) = x$. If such a function exists, then there's another test. For example, if I wrote a matrix-inversion function, I expect that in non-pathological cases Inv(Inv($X$))=$X$ (again, see next time about tolerances).

Much of Apophenia's model testing looks like this. I make a thousand random draws from a model given parameters $P$, that produces a data set $D$. I know that if I estimate the parameters of the same model using the data set $D$, then the calculated parameters should be close to $P$. I basically define the RNG to be the function such that this (fix parameters--draw--estimate parameters) round trip works. So there's a test.

Now map that same set of a thousand random draws $d_1, d_2, …$ to a set of a thousand CDF values, $CDF(d_1), CDF(d_2), …$. This will be Uniformly distributed, so there's another test of RNG + CDF.

Especially with these round-trips, I find brainstorming for petit theorems to be pretty fun. Writing code that works reliably requires a lot of time doing mechanical things closer to transforming square pegs so they'll fit in round holes than actual math. A good test takes a function that does some nontrivial calculation, then does another completely unrelated calculation, and produces exactly the expected result. When one of these sets of transformations work for the first time, it feels less like a petit theorem and more like a petit miracle. For a few moments, I feel like the world works. I get that buzz that drives us to do more math.

Write everything twice

Sometimes, the theorem proves that the efficient and graceful method you just implemented is equivalent to a tedious and computationally intensive alternative. So, there's a test. It's unfortunate, because now you have to sit down and code the ungraceful long way of doing things. At least it only has to work for a handful of test cases. E.g., for exponential families, you can Bayesian update by looking up the posterior on the conjugate tables, or you can do MCMC, and the output distributions should approach identical.

At my workplace, we have a lot of people who are PhDs in their field but amateurs in the coding world, so they don't have a concept of unit testing. Instead, everything reasonably important is double-coded, wherein two independent parties independently write the whole darn program. I tell you this for perspective, so you see that double-coding the tedious version of a specific routine to check specific inputs is not so bad.

We might have another code base somewhere else that calculates the same thing, and can compare our results to theirs. Or, we may have a complete run with a certain data set that we've eyeballed closely and are confident is correct. I'm reluctant to rely on this for testing. There is no theorem that the other guy did the math right. Such compare-the-printout tests work as a regression test (see next time), but if you make legitimate changes, like improving precision or rearranging the output, now you've gotta rewrite the checks. I personally rely on such argument by authority only for eyeballing things and for checking print output routines, and even that much has been a headache for me.

Anything sufficiently well understood can have assertions written about it. I have no idea what sort of mathematical result inspired you to write the code you're writing, but I hope that some of the examples above will help you as you brainstorm petit theorems about your own work. As noted last time, none of these tests prove that your code is Correct, but each shows that you wrote something that meets increasingly tight constraints, and gives users a little more confidence that your code is reliable.

3 May 14.

Testing statistical software I: units

link PDF version

Your trust in anything is an emotional state, not something that can be proven by some routine. There is a literature on proving that a given piece of code does what it claims, but that literature works at a much lower level than we do here in the world of statistical calculation, and tends to avoid floating-point math anyway. Even the best proof of code correctness doesn't check that the input to the code is in the right format, that NaNs and missing data are handled correctly, or that theorem assumptions that you were supposed to check got checked.

So, what makes us feel good about code? The next few entries will cover that question. The first will cover some basic mechanics, the second will cover brainstorming ideas for testing numeric results, and the third will cover some additional details like tolerance and regression.

For a simple statistical problem, you read in the data, run the regression, and read the output. A lot of the people who do this sort of thing all day insist that they aren't really writing code, thus making some sort of distinction between calling functions to do math and what the package authors do—which, of course, consists of calling fucntions to do math. Good habits that the package authors developed are directly applicable to code that just loads data/runs regression/reads output.

Unit testing is the norm in the programming world, but not the statistical world. The simple idea is to break your work down into the smallest parts possible, then write a test for each part. Every added test will raise confidence in your overall results, because every step is that much more under control.

We already have some units: load data, run regression, check output. Typically, most testing for these units happens by eyeballing. Are the ages all under 100, and the sexes all either 0/1 or "M"/"F"? This is typically verified by printing the first dozen lines to the screen and looking at it, but why not automate this? We've written an R package for survey processing, Tea, that checks and corrects survey data, but it's probably overkill for most needs.

Every platform has some variant of the assert function/macro, which checks whether a claim is true, and prints an error or halts if the claim fails. In R, you could write

stopifnot(max(age) < 100 && min(age) > 0)

If you have a few of these, you might as well put them into a function. Where before you may have had

frame <- read.csv("indata.csv")

now you have

get_data <- function(filename){
    frame <- read.csv(filename)
    stopifnot(max(frame[,"age"]) < 100 && min(frame[,"age"]) > 0)
    sexes <- unique(frame[,"sex"])
    stopifnot(length(sexes) < 3 && sexes[0] %in% c("M", "F") && sexes[1] %in% c("M", "F"))

frame <- get_data("indata.csv")

I don't know about you, but I already feel better about the second form, becaue I know reading in data is one of the most failure-prone parts of the process. People do stupid things like save the wrong data to the right file name, or misread the units, or get delimiters wrong and read in sex = "100|M|East Lansing". The get_data function barks when any of these common sort of snafus happen.

But, you argue, my statistical calculation will only be run once, so why go through all this effort?. Except even a one-shot calculation is verified, by yourself in a month, by colleagues, by the Platonic ideal of the referee. Next week, when you get an updated data set, all that eyeballing you did before happens automatically. Next month, when you start a new project on similar data, you can cut/paste a lot of your assertions from one project to the next.

In 21st Century C, I spend five pages discussing what makes a good assert macro [I'm having trouble putting together a link to Google Books, but search for "example 2-5".] Here's what I wound up with for C, including a logging option, an option to continue or halt the program (very useful when debugging), or run any arbitrary code on error (return -1 is common, but error_action can also free pointers, goto a cleanup segment, do nothing and just print the error message, et cetera).

I reversed the sense of the standard assert and R's stopifnot to a Stopif, because the first two are a sort of double negative: if this claim is not true, then do not continue. From my own experience and anecdotal comments from colleagues, it's clearer to cut one negative: if this happens, do not continue.

All that said, adding assertions of things that seem obvious is a fast way to

  • document your expectations for the data and results at this point, to yourself and other readers of your routine.
  • clarify your own thinking. What do you expect that the regression results would look like, and what would indicate something so out of bounds that something is really broken? If you can give a thought-out answer those questions, then you can codify them in assertions.
  • Cut time debugging when your source sends the revised data set and all the field descriptions change or you start getting ages with values like "M" and "F".
  • Feel better about the validity of your code.

So even if you are the sort of person who only writes code to read in a data set and run canned procedures (would $\beta_1=1000$ be valid?), there is already room to include testing.

There are debates about whether to ship code with active assertions to users, but for most statistical software the question is moot because the end-user is you, the author. In this case, there is no downside to having lots of assertions, because you wrote assertions to cover those things that you the author would want to know about if they don't hold.

Next time I'll get to unit testing mathematical results that are not as easy to eyeball.

9 March 14.

Regular Expression parsing in C

link PDF version

Regular expressions are a means of expressing a pattern in text, like (a number followed by one or more letters) or (number-comma-space-number, with nothing else on the line); in basic regex-ese, these could be expressed as [0-9]\+[[:alpha:]]\+ and ^[0-9]\+, [0-9]\+\$. The POSIX standard specifies a set of C functions to parse the regular expressions whose grammar it defines, and those functions have been wrapped by hundreds of tools. I think it is literally true that I use them every day, either at the command line via POSIX-standard tools like sed, awk, and grep, or to deal with little text-parsing details in code. Maybe I need to find somebody's name in a file, or somebody sent me date ranges in single strings like 04Apr2009-12Jun2010 and I want to break that down into six usable fields, or I have a fictionalized treatise on cetology and need to find the chapter markers.

A reminder: if you want to break a string down into tokens before and after a single-character delimiter, strtok will work for you.

However, I resolved to not include a regular expression tutorial in this book. My Internet search for “regular expression tutorial" gives me 12,900 hits. On a Linux box, man 7 regex should give you a rundown, and if you have Perl installed, you have man perlre summarizing Perl-compatible regular expressions (PCREs). Or, Mastering Regular Expressions gives an excellent book-length discussion of the topic. Here, I will cover how they work in POSIX's C library.

There are three major types of regular expression:

  • Basic regular expressions (BREs) were the first draft, with only a few of the more common symbols having special meaning, like the * meaning zero or more of the previous atom, as in [0-9]* to represent an optional integer. Additional features required a backslash to indicate a special character: one or more digits is expressed via \+, so an integer preceded by a plus sign would be +[0-9]\+.

  • Extended regular expressions (EREs) were the second draft, mostly taking special characters to be special without the backslashes, and plain text with a backslash. Now an integer preceded by a plus sign is \+[0-9]+.

  • Perl has regular expressions at the core of the language, and its authors made several significant additions to the regex grammar, including a lookahead/lookbehind feature, non-greedy quantifiers that match the smallest possible match, and in-regex comments.

The first two types of regular expression are implemented via a small set of functions. Being defined in POSIX, they are probably part of your standard library. PCREs are available via libpcre, which you can download from online or via your package manager. See man pcreapi for details of its functions. Glib provides a convenient wrapper for libpcre.

The example below, for example, compiles on Linux and Mac without any compiler flags beyond the usual necessities:

CFLAGS="-g -Wall -O3 --std=gnu11" make regex

The POSIX and PCRE interfaces have a common four-step procedure:

  • Compile the regex via regcomp or pcre_compile

  • Run a string through the compiled regex via regexec or pcre_exec.

  • If you marked substrings in the regular expression to pull out (see below), copy them from the base string using the offsets returned by the regexec or pcre_exec function.

  • Free the internal memory used by the compiled regex.

The first two steps and the last step can be executed with a single line of code each, so if your question is only whether a string matches a given regular expression, then life is easy. I won't go into great detail about the flags and details of usage of regcomp, regexec, and regfree here because the page of the POSIX standard about them is reprinted in the Linux and BSD man pages (try man regexec), and there are many websites devoted to reprinting those man pages.

If you need to pull substrings, things get a little more complicated. Parens in a regex indicate that the parser should retrieve the match within that string (even if it only matches the null string). Thus, the ERE pattern "(.*)o" matches the string "hello", and as a side-effect, stores the largest possible match for the .*, which is hell. BREs work the same but with backslashes preceding the parens. The third argument to the regexec function is the number of parenthesized subexpressions in the pattern; I call it matchcount in the example below. The fourth argument to regexec is an array of matchcount+1 regmatch_t elements. The regmatch_t has two elements: rm_so marking the start of the match and rm_eo marking the end. The zeroth element of the array will have the start and end of the match of the entire regex (imagine parens around the entire pattern), and subsequent elements have the start and end of each parenthesized subexpression, ordered by where their open-parens are in the pattern.

By way of foreshadowing, here is the header, listing the two utility functions provided by the sample code at the end of this segment. The regex_match function+macro+struct allows named, optional arguments, as per a previous blog post.

typedef struct {
    const char *string;
    const char *regex;
    char ***substrings;
    _Bool use_case;
} regex_fn_s;

#define regex_match(...) regex_match_base((regex_fn_s){__VA_ARGS__})

int regex_match_base(regex_fn_s in);

char * search_and_replace(char const *base, char const*search, char const *replace);

We need a separate search and replace function, because POSIX doesn't provide one. Unless the replacement is exactly the same length as what it is replacing, the operation requires reallocating the original string. But we already have the tools to break a string into substrings, so the search_and_replace uses parenthesized substrings to break down a function into substrings, and then rebuilds a new string, inserting the replacement part at the appropriate point.

It returns NULL on no match, so you could do a global search and replace via

char *s2;
while((s2 = search_and_replace(long_string, pattern))){
    char *tmp = long_string;
    long_string = s2;

There are inefficiencies here: the regex_match function recompiles the string every time, and the global search-and-replace would be more efficient if it used the fact that everything up to result[1].rm_eo does not need to be re-searched. In this case, we can use C as a prototyping language for C: write the easy version, and if the profiler shows that these inefficiencies are a problem, replace them with more efficient code.

Here is the code. The lines where key events in the above discussion occur are marked, with some additional notes at the end. There is a test function at the end showing some simple uses of the provided functions.

To compile this, you'll need stopif.h; get it from Github.

#define _GNU_SOURCE //cause stdio.h to include asprintf
#include "stopif.h"
#include <regex.h>
#include "regex_fns.h"
#include <string.h> //strlen
#include <stdlib.h> //malloc, memcpy

static int count_parens(const char *string){                //(1)
    int out = 0;
    int last_was_backslash = 0;
    for(const char *step=string; *step !='\0'; step++){
        if (*step == '\\' && !last_was_backslash){
            last_was_backslash = 1;
        if (*step == ')' && !last_was_backslash)
        last_was_backslash = 0;
    return out;

int regex_match_base(regex_fn_s in){
    Stopif(!in.string, return -1, "NULL string input");
    Stopif(!in.regex, return -2, "NULL regex input");

    regex_t re;
    int matchcount = 0;
    if (in.substrings) matchcount = count_parens(in.regex);
    regmatch_t result[matchcount+1];
    int compiled_ok = !regcomp(&re, in.regex, REG_EXTENDED                       //(2)
                                            + (in.use_case ? 0 : REG_ICASE)
                                            + (in.substrings ? 0 : REG_NOSUB) );
    Stopif(!compiled_ok, return -3, "This regular expression didn't compile: \"%s\"", in.regex);

    int found = !regexec(&re, in.string, matchcount+1, result, 0);               //(3)
    if (!found) return 0;
    if (in.substrings){
         *in.substrings = malloc(sizeof(char*) * matchcount);
         char **substrings = *in.substrings;
        //match zero is the whole string; ignore it.
        for (int i=0; i< matchcount; i++){
            if (result[i+1].rm_eo > 0){//GNU peculiarity: match-to-empty marked with -1.
                int length_of_match = result[i+1].rm_eo - result[i+1].rm_so;
                substrings[i] = malloc(strlen(in.string)+1);
                memcpy(substrings[i], in.string + result[i+1].rm_so, length_of_match);
                substrings[i][length_of_match] = '\0';
            } else { //empty match
                substrings[i] = malloc(1);
                substrings[i][0] = '\0';
        in.string += result[0].rm_eo; //end of whole match;
    regfree(&re);                                                                //(4)
    return matchcount;

char * search_and_replace(char const *base, char const*search, char const *replace){
    char *regex, *out;
    asprintf(®ex, "(.*)(%s)(.*)", search);                                   //(5)
    char **substrings;
    int match_ct =  regex_match(base, regex, &substrings);
    if(match_ct < 3) return NULL;
    asprintf(&out, "%s%s%s", substrings[0], replace, substrings[2]);
    for (int i=0; i< match_ct; i++)
    return out;

#ifdef test_regexes
int main(){
    char **substrings;

    int match_ct = regex_match("Hedonism by the alps, savory foods at every meal.",
            "([He]*)do.*a(.*)s, (.*)or.* ([em]*)al", &substrings);
    printf("%i matches:\n", match_ct);
    for (int i=0; i< match_ct; i++){
        printf("[%s] ", substrings[i]);

    match_ct = regex_match("", "([[:alpha:]]+) ([[:alpha:]]+)", &substrings);
    Stopif(match_ct != 0, return 1, "Error: matched a blank");

    printf("%s", search_and_replace("Plants\n", "l", ""));

(1) You need to send regexec an allocated array to hold substring matches, and its length, meaning that you need to know how many substrings there are. This function takes in an ERE and counts open-parens that aren't escaped by a backslash.

(2) Here we compile the regex to a regex_t. The function is inefficient because the regex gets recompiled every time. It is left as an exercise to the reader to cache already-compiled regular expressions.

(3) Here is the regexec use. If you just want to know whether there is a match or not, you can send NULL and 0 as the list of matches and its length.

(4) Don't forget to free the internal memory used by the regex_t.

(5) The search-and-replace works by breaking down the input string into (everything before the match)(the match)(everything after the match). This is the regex representing that.