7 June 15.

Editing survey data (or, how to deal with pregnant men)

link PDF version

This post is partly a package announcement for Tea, a package for editing and imputation. But it mostly discusses the problem of editing, because I've found that a lot of people put no thought into this important step in data analysis.

If your data is a survey filled in by humans, or involves using sensors that humans operated, then your data set will have bad data.

But B, you object, I'm a downstream user of a data set provided by somebody else, and it's a clean set—there are no pregnant men—so why worry about editing? Because if your data is that clean, the provider already edited the data, and may or may not have told you what choices were made. Does nobody have gay parents because nobody surveyed has gay parents, or because somebody saw those responses and decided they must be an error? Part of the intent of Tea is to make it easy to share the specification of how the data was edited and missing data imputed, so end users can decide for themselves whether the edits make sense.

If you want to jump to working with Tea itself, start with the Tutorial/manual.

For editing bad values in accounting-type situations, you may be able to calculate which number in a list of numbers is incorrect—look up Fellegi-Holt imputation for details. But the situations where something can be derived like this are infrequent outside of accounts-based surveys.

So people wing it.

Some people will listwise delete the record with some failure in it, including a record that is missing entirely. This loses information, and can create a million kinds of biases, because you've down-weighted the deleted observation to zero and thus up-weighted all other observations. An analyst who deletes the observations with N/As for question 1 only when analyzing question 1, then restores the data set and deletes the N/As for question 2 when analyzing question 2, ..., is creating a new weighting scheme for every question.

I don't want to push you to use Tea, but I will say this: listwise deletion, such as using the ubiquitous na.rm option in R functions, is almost always the wrong thing to do.

Major Census Bureau surveys (and the decennial census itself) tend to lean on a preference ordering for fields and deterministic fixes. Typically, age gets fixed first, comparing it to a host of other fields, and in a conflict, age gets modified. Then age is deemed edited, and if a later edit involving age comes up it's complicated..., then age stays fixed and the other fields are modified. There is usually a deterministic rule that dictates what those modifications should be for each step. This is generally referred to as sequential editing.

Another alternative is to gather evidence from all the edits and see if some field is especially guilty. Maybe the pregnant man in our public health survey also reports using a diaphragm for contraception and getting infrequent pap smears. So of the pair (man, pregnant), it looks like (man) is failing several edits.

If we don't have a deterministic rule to change the declared-bad field, then all we can do is blank out the field and impute, using some model. Tea is set up to be maximally flexible about the model chosen. Lognormal (such as for incomes), Expectation-Maximization algorithm, simple hot deck draw from the nonmissing data, regression model—just set method: lognormal or EM or hot deck or ols in the spec file, and away you go.

Tea is set up to do either sequential or deterministic edits. Making this work was surprisingly complicated. Let me give you a worst-case example to show why.

E.g.

Assume these edits: fail on
(age $<$ 15 and status=married),
(age $<$ 15 and school=PhD),
and (age$>$65 and childAge $<$ 10 $=>$ change childAge to age-25).
The third one has a deterministic if-then change attached; the first two are just edits.

I think these are reasonable edits to make. It is entirely possible for a 14 year old to get a PhD, just as there has existed a pregnant man. Nonetheless, the odds are much greater when your survey data gives you a 12-year old PhD or a pregnant man that somebody committed an error, intentional or not.

Then, say that we have a record with (age=14, married, PhD, childAge=5), which goes through these steps:

  • Run the record through the edits. Age fails two edits, so we will blank it out.
  • Send age to the imputation system, which draws a new value for the now-blank field. Say that it drew 67.
  • Run the record through the edits, and the deterministic edit hits: we have to change the child's age to 42.

First, this demonstrates why writing a coherent editing and imputation package is not a pleasant walk in the park, as edits trigger imputations which trigger edits, which may trigger more imputations.

Second, it advocates against deterministic edits, which can be a leading cause of these domino-effect outcomes that seem perverse when taken as a whole.

Third, it may advocate against automatic editing entirely. It's often the case that if we can see the record, we could guess what things should have been. It's very plausible that age should have been 41 instead of 14. But this is guesswork, and impossible to automate. From Census experience, I can say that what you get when you add up these little bits of manual wisdom for a few decades is not necessarily the best editing system.

But not all is lost. We've generated consistent micro-data, so no cross-tabulations will show anomalies. If we've selected a plausible model to do the imputations, it is plausible that the aggregate data will have reasonable properties. In many cases, the microdata is too private to release anyway. So I've given you a worst-case example of editing gone too far, but even this has advantages over leaving the data as-is and telling the user to deal with it.

Multiply impute

Imputation is the closely-allied problem of filling in values for missing data. This is a favored way to deal with missing data, if only because it solves the weighting problem. If 30% of Asian men in your sample didn't respond, but 10% of the Black women didn't respond, and your survey oversampled women by 40%, and you listwise delete the nonresponding observations, what reweighting should you do to a response by an Asian woman? The answer: side-step all of it by imputing values for the missing responses to produce a synthetic complete data set and not modifying the weights at all.

Yes, you've used an explicit model to impute the missing data—as opposed to the implicit model generated by removing the missing data. To make the model explicit is to face the reality that if you make any statements about a survey in which there is any missing data, then you are making statements about unobserved information, which can only be done via some model of that information. The best we can do is be explicit about the model we used—a far better choice than omitting the nonresponses and lying to the reader that the results are an accurate and objective measure of unobserved information.

We can also do the imputation multiple times and augment the within-imputation variance with the across-imputation variance that we'd get from different model-based imputations, to produce a more correct total estimate of our confidence in any given statistic. That is, using an explicit model also lets us state how much our confidence changes because of imputation.

There are packages to do multiple imputation without any editing, though the actual math is easy. Over in the other window, here's the R function I use to calculate total variance, given that we've already run several imputations and stored each (record,field,draw) combination in a fill-in table. The checkOutImpute function completes a data set using the fill-ins from a single imputation, generated by the imputation routine earlier in the code. There's, like, one line of actual math in there.

library(tea)

get_total_variance <- function(con, tab, col, filltab, draw_ct, statvar){
    v <- 0  #declare v and m
    m <- 0
    try (dbGetQuery(con, "drop table if exists tt"))
    for (i in 1:draw_ct){
        checkOutImpute(dest="tt", origin=tab, filltab=filltab, imputation_number=i-1)
        column <- dbGetQuery(con, paste("select ", col, " from tt"))
        vec <- as.numeric(column[,1]) #type conversions.
        v[i] <- statvar(vec)
        m[i] <- mean(vec)
    }
    total_var <- mean(v) + var(m)/(1+1./draw_ct)
    return(c(mean(m), sqrt(total_var)))
}

# Here's the kind of thing we'd use as an input statistic: the variance of a Binomial
binom_var <- function(vec){
    p = mean(vec)
    return(p*(1-p)/length(vec))
}


12 March 15.

Banning the hypothesis test

link PDF version

In case you missed it, a psychology journal, Basic and Applied Social Psychology, has banned the use of hypothesis tests.

Much has already been said about this, very little in support. The ASA points out that this approach may itself have negative effects and a committee is already working on a study. Civil statistician, a person who is true to his nom de blog and never says anything uncivil about anything, is very annoyed. Nature defended the party line.

This is an opportunity for statisticians to remind the world of what these $p$-values mean, exactly, and when they can or can not be trusted. A hypothesis test provides a sense of crisp, pass-or-fail clarity, but this can be a bad thing in situations where there is far too much complexity for crisp anything. How can we get readers to take $p$-values with the grain of salt that they must be taken with?

I agree with the dissenters, in the sense that if I were the editor of this journal, this is not something I would have done. If nothing else, smart people find a way to route around censorship. As noted by some of the commenters above, if you can only provide standard errors, I can do the mental arithmetic to double them to get the approximate 95% confidence intervals. Banning the reporting of the sample size and variance of the data would effectively keep me from solving for the confidence interval, but I doubt even these editors would contemplate such a ban.

The editors claim that $p$-values are `invalid'. In the very narrow sense, this is kinda crazy. The values are based on theorems that work like any other mathematical theorem: given the assumptions and the laws of mathematics that we generally all accept, the conclusion holds with certainty. But once we look further at the context, things aren't so bright-lined:

$\bullet$ A $p$-value is a statement about the likelihood of a given claim about model parameters or a statistic of the data using the probability distribution defined by the model itself. We do not know the true probability distribution of the statistic, and so have to resort to testing the model using using the model we are testing.

$\bullet$ A test is in the context of some data gathering or experimental design. Psychology is not Physics, and small details in the experimental design, such as the methods of blinding and scoring, matter immensely and are not necessarily agreed upon. In my experience as a reader, I am far more likely to dismiss a paper because a point of design made the paper too situation-specific or too fallible than because they reject the null with only 89% confidence.

$\bullet$ We are Bayesian when reading papers, in the sense that we come in with prior beliefs about whether a given fact about the world is true, and use the paper to update our belief. At the extreme, a paper on ESP that proves its existence with 99.9% confidence might marginally sway me into thinking something might be there, but in my mind I'd be arguing with the methods and it'll take a hundred such papers before I take it seriously. A paper finding that colorblind people process colors differently from typically-sighted people would get a well, yeah from me even if the hypothesis test finds 85% confidence, and in my mind I'd think about how the experiment could have been improved to get better results next time.

A corollary to this last bullet point is that the editors are also Bayesian, and are inclined to believe some theories more than others. One editor may dislike the theory of rational addiction, for example, and then what keeps the editor from desk rejecting any papers that support the theory? Having only qualitative information means one less check on biases like these.

The full set of bullet points show how a crisp $p$-value can be misleading, in terms of the modeling, of the experiment as a whole, and the manner in which readers digest the information. Assuming Normally-distributed data, the $p$-value can be derived from first principles, but the statement that the reader should reject the null with $1-p$% probability requires accepting that nothing went wrong in the context that led up to that number. (By the way, $1-p$ is the $q$-value, and I sometimes wish that people reported it instead.)

Psychological problems

Psychology is not physics, where context can be controlled much more easily. To talk meta-context, a psychology study has so many challenges and confounders that the typical $p$-value in a psychology journal is a qualitatively different thing from a $p$-value in a biology or physics journal. A $p$-value can be read as a claim about the odds of getting the same result if you repeat the experiment, but defining what goes into a correct replication is itself a challenge the psych literature is grappling with in the present day. [But yes, there are high-quality, simple, reproducible psych experiments and badly designed physics experiments.]

Second, your revolution in the understanding of drosophila is going to be upstaged in the papers by the most piddling result from a psychology lab, every time. There are press agents, journalists, pop psychologists, and people selling books who have very little interest in the complexities of a study in context, and every interest in finding a study that `proves' a given agendum, and they have a large population of readers who are happy to eat it up.

Maybe you recall the study that obesity is contagious, perhaps because you read about it in Time magazine. With lower likelihood, you saw the follow-up studies that questioned whether the contagion effect was real, or could be explained away by the simple fact that similar people like to hang out together (homophily). Much to their credit, Slate did a write-up of some of the contrary papers. Or maybe you saw the later study that found that obesity contagion is reasonably robust to homophily effects.

I'm not going to wade into whether obesity patterns show contagion effects beyond homophily here, but am going to acknowledge that finding the answer is an imperfect process that can't be summarized by any single statistic. Meanwhile, the journalists looking for the biggest story for Time magazine aren't going to wade into the question either, but will be comfortable stopping at the first step.

So I think it's an interesting counterfactual to ask what the journalists and other one-step authors would do if a psychology journal didn't provide a simple yes-or-no and had to acknowledge that any one study is only good for updating our beliefs by a step.

I commend the editors of the BASP, for being bold and running that experiment. It doesn't take much time in the psychology literature to learn that our brains are always eager to jump on cognitive shortcuts, yet it is the job of a researcher to pave the long road. No, if I were editor I would never ban $p$-values—I've pointed out a few arguments against doing so above, and the links at the head of this column provide many more valid reasons—but these editors have taken a big step in a discussion that has to happen about how we can report statistical results in the social sciences in a manner that accommodates all the uncertainty that comes before we get to the point where we can assume a $t$ distribution.


3 March 15.

Overlapping bus lines

link PDF version

I have at this point become a regular at the Open Data Day hackathon, hosted at the World Bank, organized by a coalition including the WB, Code for DC and the hyperproductive guy behind GovTrack.us.

This year, I worked with the transportation group, which is acting as a sort of outside consultant to a number of cities around the world. My understanding of the history of bus lines in any city is that bus lines start off by some enterprising individual who decides to buy a van and charge for rides. Routes are decided by the individual, not by some central planner. With several profit-maximizing competitors, especially lucrative routes will be overcrowded with redundant lines relative to what a central planner could do, taking into account congestion, pollution, even headways, and system complexity.

Many places see a consolidation. For example, the Washington Metro Area Transit Authority was formed by tying together many existing private lines. Over the course of decades, some changes were made to consolidate. The process of tweaking the lines from the 1900s still slowly continue to this day.

Measuring overlap

Here's where the data comes in. A Bank project [led by Jacqueline Klopp and Sarah Williams] developed a map of Nairobi's bus maps, by sending people out on the bus with a GPS-enabled gadget, to record the position every time the bus stopped. The question the organizers [Holly Krambeck, Aaron Dibner-Dunlap] brought to Open Data Day: how much redundancy is there in Nairobi's system, and how does it compare to that of other systems?

We defined an overlap as having two stops with latitude and longitude each within .0001 degrees of each other: roughly 90 meters, which is a short enough walk that you could point and say `go stand over there for your next bus'. It also makes the geographic component of the problem trivial, because we can just ask SQL to find (rounded) numbers that match, without involving Pythagoras.

GTFS data is arranged in routes, which each have one or more trips. We considered only route overlaps, which may have a significant effect on our final results if night bus trips are very different from day bus trips. Modifying the code below to account for time is left as a future direction for now.

The data for Chicago's CTA, Los Angeles, and the DC area's WMATA have both bus and subway/el routes.

On the horizontal axis of this plot, we have the percent overlap between two given routes, and on the vertical axis, we have the density of route pairs, among route pairs that have any overlap at all. In all cities, about 90% ($\pm$2%) of routes have no overlap, and are excluded from this density.





The hunch from our WB transportation experts was right: WMATA, LA, and CTA have pretty similar plots, but Nairobi's plot meanders for a while with a lot of density even up to 30% overlap.

The map and the terrain

The ideal bus map would form a grid in a perfect world. For example, Chicago is almost entirely a grid, with major streets at regular intervals that go forever (e.g., Western Ave changes names at the North and South ends, but runs for 48km). The CTA's bus map looks like the city map, with a bus straight down each major street. The overlap for any N-S bus with any E-W bus is a single intersection. The routes that have a lot of overlap are the ones downtown, along the waterfront, and on a few streets like Michigan Ave.

Further East and in older cities, things fall apart and the ideal of one-street-one-bus is simply impossible.

Thad Kerosky fed the above data to QGIS to put the stops that have nonzero overlap on a map:





The bus overlaps basically produce a map of the major arterial roads and major bottlenecks in the city grid (plus the airport).

So the problem seems to partly be geography, and there's not much that can be done about that. The last time a government had the clout to blow out the historic map to produce a grid was maybe 1870, and there aren't any countries left with Emperors who can mandate this kind of thing. But that doesn't preclude the possibility of coordinating routes along those arteries in a number of ways, such as setting up trunk-and-branch sets of coordinated schedules.

How to

Keeping with the theme of not overengineering, we used a set of command line tools to do the analysis. We had a version in Python until one of the team members pointed out that even that was unnecessary. You will need SQLite, Apophenia, and Gnuplot. We also rely on a GNU sed feature and some bashisms. It processes WMATA's 1.6 million stop times on my netbook in about 70 seconds.

Start off by saving a GTFS feed as a zip file (this is the norm, e.g., from the GTFS Data Exchange), save this script as, e.g., count_overlaps, then run

Zip=cta.zip . count_overlaps

to produce the pairs table in the database and the histogram for the given transit system.

The script produces individual plots via Gnuplot, while the plot above was via R's ggplot, which in this case isn't doing anything that Gnuplot couldn't do.

#!/usr/bin/bash   #uses some bashisms at the end

if [ "$Zip" = "" ] ; then
  echo Please set the Zip enviornment variable with the zip file with your GTFS data
else #the rest of this file

base=`basename $Zip .zip`
mkdir $base
cd $base
unzip ../$Zip

DB=${base}.db

for i in *.txt; do sed -i -e 's/|//g' -e "s/'//g"  $i; done
for i in *.txt; do apop_text_to_db $i `basename $i .txt` $DB; done

sqlite3 $DB "create index idx_trips_trip_id on trips(trip_id);"
sqlite3 $DB "create index idx_trips_route_id on trips(route_id);"
sqlite3 $DB "create index idx_stop_times_trip_id on stop_times(trip_id);"

sqlite3 $DB << ——

create table routes_w_latlon as
    select distinct route_id, s.stop_id, round(stop_lat, 4) as stop_lat,
       round(stop_lon, 4) as stop_lon 
       from stops s, stop_times t, trips tr
       where s.stop_id = t.stop_id
       and tr.trip_id=t.trip_id ;

create index idx_trips_rid on routes_w_latlon(route_id);
create index idx_trips_lat on routes_w_latlon(stop_lat);
create index idx_trips_lon on routes_w_latlon(stop_lon);

create table pairs as
  select routea, routeb,
    ((select count(*) from
     (select distinct * from
     routes_w_latlon L, routes_w_latlon R
     where
     L.route_id = routea
     and
     R.route_id = routeb
     and L.stop_lat==R.stop_lat and L.stop_lon==R.stop_lon))
    +0.0)
    / (select count(*) from routes_w_latlon where route_id=routea or route_id=routeb)
     as corr
    from 
    (select distinct route_id as routea from routes),
    (select distinct route_id as routeb from routes)
    where routea+0.0<=routeb+0.0;
——

cat <(echo "set key off;
set xlabel 'Pct overlap';
set ylabel 'Count';
set title '$base' ;
set xrange [0:.6];
set term png size 1024,800;
set out '${base}.png';
plot '-' with impulses lt 3") <(apop_plot_query -f- -H0 $DB "select corr from pairs where corr > 0 and corr < 1"| sed '1,2d') | gnuplot

fi


23 January 15.

m4 without the misery

link PDF version

I presented at the DC Hack and Tell last week. It was fun to attend, and fun to present. I set the bar low by presenting my malcontent management system, which is really just a set of shell and m4 scripts.

I have such enthusiasm for m4, a macro language from the 1970s that is part of the POSIX standard, because there aren't really m4 files, the way there are C or Python or LaTeX files. Instead, you have a C or Python or LaTeX file that happens to have some m4 strewn about. Got something that is repetitive that a macro or two could clean up? Throw an m4 macro right there in the file and make immediate use. And so, m4 is the hammer I use to even out nails everywhere. C macros can't generate other C macros (see example below), and LaTeX macros are often fragile, so even where native macros are available it sometimes makes sense to use m4 macros instead. Even the markup for this column is via m4.





What discussion there was after the hack and tell was about how I can even use m4, given its terrible reputation as a byzantine mess. So an inquiry must be made about what I'm doing differently that makes this tolerable. I was in a similar situation with the C programming language, and my answer to how I use C differently from sources that insist that it's a byzantine mess turned into a lengthy opus.

M4 is simpler, so my answer is only a page or two.

I assume you're already familiar with m4. If not, there are a host of tutorials out there for you. I have two earlier entries, starting here. Frankly, 90% of what you need is m4_define, so if you get that much, you're probably set to start trying things with it.

As above, m4 passes not-m4-special text without complaint, but it is very aggressive in substituting anything that m4 recognizes. This leads to the advice that for every pair of parens, you should have a pair of quote-endquote markers to protect the text, which leads to m4-using files with a million quote-endquote markers.

I've found that this advice is overcautious by far.

In macro definitions, the `laziness' of the expansion is critical (do I evaluate $# when the macro is defined, when it is first called, or by a submacro defined by this macro?), and the quote-endquote markers are the mechanism to control that timing. This is a delicate issue that every macro language capable of macro-defining macros runs into. My only advice is to read the page of the manual on how macro expansion occurs very carefully. The first sentence is a bit misleading, though, because the scan of the text is itself treated as a macro expansion, so one layer of quote-endquote markers are stripped, dnl is handled, et cetera. But because I am focused on writing my other-language text with support from m4, not building a towering m4 edifice, my concern with careful laziness control is not as great.

So my approach, instead of putting hundreds of quotes and endqoutes all over my document, is to know what the m4 specials are, and make sure they never appear in my text unless I made an explicit choice to put them there.

The specials

Outside of macro definitions themselves (where dollar signs matter), there are five sets of m4-special tokens. There's a way to handle each of them.

  • quote-endquote markers. How do you get a quote marker or an endquote marker into your text without m4 eating it? The short answer: you can't. So we need to change those markers to something that we are confident will never appear in text. I use <| and |>.

    The longer answer, by the way, is that you would use a sequence like

    m4_changequote(LEFT,RIGHT)<|m4_changequote(<|,|>)

    Easier to just go with something that will never be an issue.

  • Named macros. You are probably using GNU m4—I personally have yet to encounter a non-GNU implementation. GNU m4 has a -P option that isn't POSIX standard, but is really essential. It puts an m4_ before every macro defined by m4: define $\Rightarrow$ m4_define, dnl $\Rightarrow$ m4_dnl, ifdef $\Rightarrow$ m4_ifdef, and so on. We're closer to worry free: there could easily be the word define in plain text, but the string m4_define only appears in macro definitions and blog entries about m4.

    We can also limit the risk of accidentally calling a macro by expanding it to something else iff it is followed by parens. The m4 documentation recommends defining a define_blind macro:

    m4_changequote(<|, |>)
    m4_define(<|define_blind|>, <|_define_blind(<|$1|>, <|$2|>, <|$|><|#|>, <|$|><|0|>)|>)
    m4_define(<|_define_blind|>, <|m4_define(<|$1|>, <|m4_ifelse(<|$3|>, <|0|>, <|<|$4|>|>,
                <|$2|>)|>)|>)
    
    sample usage:
    define_blind(test, Hellooo there.)
    
    test this mic: test()
    

    Start m4 -P from your command line and paste this in if you want to try it. You'll see that when test is used in plain text, it will be replaced with test; if parens follow it, it will be replaced with the macro expansion.

  • #comments. Anything after an octothorp is not expanded by m4, but is passed through to output. I think this is primarily useful for debugging. But especially since the advent of Twitter, hashtags appear in plain text all over the place, so suppress this feature via m4_changecom().

  • Parens. If parens aren't after a macro name, they are ignored. Balanced parens are always OK. The only annoyance is the case when you just want to have a haphazard unbalanced open- or end-paren in your text, ). You'll have to wrap it in quote-endquote markers <|)|> if it's inside of a macro call, or we can't expect m4 to know where the macro is truly supposed to end.

  • Commas. There's no way within m4 to change the comma separator. This can mess up the count of arguments in some cases,and m4 removes the space after any comma inside a macro,which looks bad in human text. My three-step solution:

    • Use sed, the stream editor, to replace every instance of , with <|,|>
    • Use sed to replace every instance of ~~ with a comma.
    • Pipe the output of sed to m4.

    That is, I used sed to turn ~~ into the m4 argument separator, so plain commas in the text are never argument separators themselves.

So, let's reduce the lessons from this list:

  • m4 -P
  • m4_changequote(<|, |>)
  • m4_changecom()
  • Use a unique not-a-comma for a separator, e.g., ~~
  • Use sed to replace all actual commas with <|,|>.

Is that too much to remember? Are you bash or zsh user? Here's a function to paste onto the command line or your .bashrc or .zshrc:

m5 () { cat $* | sed 's/,/<|,|>/g' | sed 's/\~\~/,/g' | \
               m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") -
      }

Now you can run things like m5 myfile.m4 > myfile.py.

At this point, unless you are writing m4 macros to generate m4 macros, you can write your Python or HTML or what-have-you without regard to m4 syntax, because as long as you aren't writing m4_something, <|, |>, or ~~ in your text, m4 via this pipeline just passes your text through either to your defined macros or standard output without incident.

Are there ways to break it? Absolutely. Can you use these steps to more easily build macros upon macros upon macros? Yes, but that's probably a bad idea in any macro system. Can you use this to replace repetitive and verbose syntax with something much simpler, more legible, and maintaiable? Yes, when implemented with apropriate common sense.

An example

Here is a sample use. C macro names have to be plain text—we can't use macro tricks when naming macros. But we can use m4 to write C macros without such restrictions. This example is not especially good form (srsly) but gives you the idea. Cut and paste this entire example onto your command line to create pants_src.c pantsprogram, pants, and octopants.


#The above shell function again:
m5 () { cat $* | sed 's/,/<|, |>/g' | sed 's/\~\~/,/g' |\
             m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") - 
      }

# Write m4-imbued C code to a file
cat << '-- --' > pants_src.c

#include <stdio.h>

m4_define(Def_print_macro~~
  FILE *f_$1 = NULL;
  #define print_to_$1(expr, fmt)                   \
    {if (!f_$1) f_$1 = fopen("$1", "a+");          \
    fprintf(f_$1, #expr "== " #fmt "\n", (expr));  \
    }
)

int main(){
    Def_print_macro(pants)
    Def_print_macro(octopants)

    print_to_pants(1+1, %i);
    print_to_octopants(4+4, %i);

    char *ko="khaki octopus";
    print_to_octopants(ko, %s);
}

-- --

# compile. Use clang if you prefer.
# Or just call "m5 pants_src.c" to view the post-processed pure C file.
m5 pants_src.c | gcc -xc - -o pantsprogram

#Run and inspect the two output files.
./pantsprogram
cat pants
echo
cat octopants


3 January 15.

A version control tutorial with Git

link PDF version

This is the revision control chapter of 21st Century C, by me, published by O'Reilly Media. I had to sign over all rights to the book—three times over, for some reason I'm still not clear on. But I was clear throughout the contract negotiations of both first and second editions that I retain the right to publish my writing on this blog, and that I retain the movie rights. The great majority of the content in the book is available via the tip-a-day series from this post et seq, or the chapter-long post on parallell processing in C.

The chapter on revision control gets especially positive reviews. One person even offered to translate it into Portugese; I had to refer him to O'Reilly and I don't know what happened after that. It's in the book because I think it'd be hard to be writing production C code in the present day without knowing how to pull code from a git repository. But in the other direction, this tutorial is not really C-specific at all.

So, here it is, with some revisions, in a free-as-in-beer format, to help those of you who are not yet habitual revision control users to become so. If you like this chapter, maybe let the book buying public know by saying something nice on Goodreads or Amazon. And if you think it'll make a good movie, give me a call.

This chapter is about revision control systems (RCSes), which maintain snapshots of the many different versions of a project as it develops, such as the stages in the development of a book, a tortured love letter, or a program.

Using an RCS has changed how I work. To explain it with a metaphor, think of writing as rock climbing. If you're not a rock climber yourself, you might picture a solid rock wall and the intimidating and life-threatening task of getting to the top. But in the modern day, the process is much more incremental. Attached to a rope, you climb a few meters, and then clip the rope to the wall using specialized equipment (cams, pins, carabiners, and so on). Now, if you fall, your rope will catch at the last carabiner, which is reasonably safe. While on the wall, your focus is not reaching the top, but the much more reachable problem of finding where you can clip your next carabiner.

Coming back to writing with an RCS, a day's work is no longer a featureless slog toward the summit, but a sequence of small steps. What one feature could I add? What one problem could I fix? Once a step is made and you are sure that your code base is in a safe and clean state, commit a revision, and if your next step turns out disastrously, you can fall back to the revision you just committed instead of starting from the beginning.

But structuring the writing process and allowing us to mark safe points is just the beginning:

  • Our filesystem now has a time dimension. We can query the RCS's repository of file information to see what a file looked like last week and how it changed from then to now. Even without the other powers, I have found that this alone makes me a more confident writer.

  • We can keep track of multiple versions of a project, such as my copy and my coauthor's copy. Even within my own work, I may want one version of a project (a branch) with an experimental feature, which should be kept segregated from the stable version that needs to be able to run without surprises.

  • GitHub has about 218,000 projects that self-report as being primarily in C as of this writing, and there are more C projects in other, smaller RCS repository hosts, such as the GNU's Savannah. Even if you aren't going to modify the code, cloning these repositories is a quick way to get the program or library onto your hard drive for your own use. When your own project is ready for public use (or before then), you can make the repository public as another means of distribution.

  • Now that you and I both have versions of the same project, and both have equal ability to hack our versions of the code base, revision control gives us the power to merge together our multiple threads as easily as possible.

This chapter will cover Git, which is a distributed revision control system, meaning that any given copy of the project works as a standalone repository of the project and its history. There are others, with Mercurial and Bazaar the other front-runners in the category. There is largely a one-to-one mapping among the features of these systems, and what major differences had existed have merged over the years, so you should be able to pick the others up immediately after reading this chapter.

Changes via diff

The most rudimentary means of revision control is via diff and patch, which are POSIX-standard and therefore most certainly on your system. You probably have two files on your drive somewhere that are reasonably similar; if not, grab any text file, change a few lines, and save the modified version with a new name. Try:

diff f1.c  f2.c

and you will get a listing, a little more machine-readable than human-readable, that shows the lines that have changed between the two files. Piping output to a text file via diff f1.c f2.c > diffs and then opening diffs in your text editor may give you a colorized version that is easier to follow. You will see some lines giving the name of the file and location within the file, perhaps a few lines of context that did not change between the two files, and lines beginning with + and - showing the lines that got added and removed. Run diff with the -u flag to get a few lines of context around the additions and subtractions.

Given two directories holding two versions of your project, v1 and v2, generate a single diff file in the unified diff format for the entire directories via the recursive (-r) option:

diff -ur v1 v > diff-v1v2

The patch command reads diff files and executes the changes listed there. If you and a friend both have v1 of the project, you could send diff-v1v2 to your friend, and she could run:

patch < diff-v1v2

to apply all of your changes to her copy of v1.

Or, if you have no friends, you can run diff from time to time on your own code and thus keep a record of the changes you have made over time. If you find that you have inserted a bug in your code, the diffs are the first place to look for hints about what you touched that you shouldn't have. If that isn't enough, and you already deleted v1, you could run the patch in reverse from the v2 directory, patch -R < diff-v1v2, reverting version 2 back to version 1. If you were at version 4, you could even conceivably run a sequence of diffs to move further back in time:

cd v4
patch -R < diff-v3v4
patch -R < diff-v2v3
patch -R < diff-v1v2

I say conceivably because maintaining a sequence of diffs like this is tedious and error-prone. Thus, the revision control system, which will make and track the diffs for you.

Git's Objects

Git is a C program like any other, and is based on a small set of objects. The key object is the commit object, which is akin to a unified diff file. Given a previous commit object and some changes from that baseline, a new commit object encapsulates the information. It gets some support from the index, which is a list of the changes registered since the last commit object, the primary use of which will be in generating the next commit object.

The commit objects link together to form a tree much like any other tree. Each commit object will have (at least) one parent commit object. Stepping up and down the tree is akin to using patch and patch -R to step among versions.

The repository itself is not formally a single object in the Git source code, but I think of it as an object, because the usual operations one would define, such as new, copy, and free, apply to the entire repository. Get a new repository in the directory you are working in via:

git init

OK, you now have a revision control system in place. You might not see it, because Git stores all its files in a directory named .git, where the dot means that all the usual utilities like ls will take it to be hidden. You can look for it via, e.g., ls -a or via a show hidden files option in your favorite file manager.

Alternatively, copy a repository via git clone. This is how you would get a project from Savannah or Github. To get the source code for Git using git:

git clone https://github.com/gitster/git.git

The reader may also be interested in cloning the repository with the examples for this book:

git clone https://github.com/b-k/21st-Century-Examples.git

If you want to test something on a repository in ~/myrepo and are worried that you might break something, go to a temp directory (say mkdir ~/tmp; cd ~/tmp), clone your repository with git clone ~/myrepo, and experiment away. Deleting the clone when done (rm -rf ~/tmp/myrepo) has no effect on the original.

Given that all the data about a repository is in the .git subdirectory of your project directory, the analog to freeing a repository is simple:

rm -rf .git

Having the whole repository so self-contained means that you can make spare copies to shunt between home and work, copy everything to a temp directory for a quick experiment, and so on, without much hassle.

We're almost ready to generate some commit objects, but because they summarize diffs since the starting point or a prior commit, we're going to have to have on hand some diffs to commit. The index (Git source: struct index_state) is a list of changes that are to be bundled into the next commit. It exists because we don't actually want every change in the project directory to be recorded. For example, gnomes.c and gnomes.h will beget gnomes.o and the executable gnomes. Your RCS should track gnomes.c and gnomes.h and let the others regenerate as needed. So the key operation with the index is adding elements to its list of changes. Use:

git add gnomes.c gnomes.h

to add these files to the index. Other typical changes to the list of files tracked also need to be recorded in the index:

git add newfile
git rm oldfile
git mv flie file

Changes you made to files that are already tracked by Git are not automatically added to the index, which might be a surprise to users of other RCSes (but see below). Add each individually via git add changedfile, or use:

git add -u

to add to the index changes to all the files Git already tracks.

At some point you have enough changes listed in the index that they should be recorded as a commit object in the repository. Generate a new commit object via:

git commit -a -m "here is an initial commit."

The -m flag attaches a message to the revision, which you'll read when you run git log later on. If you omit the message, then Git will start the text editor specified in the environment variable EDITOR so you can enter it (the default editor is typically vi; export that variable in your shell's startup script, e.g., .bashrc or .zshrc, if you want something different).

The -a flag tells Git that there are good odds that I forgot to run git add -u, so please run it just before committing. In practice, this means that you never have to run git add -u explicitly, as long as you always remember the -a flag in git commit -a.

A warning: It is easy to find Git experts who are concerned with generating a coherent, clean narrative from their commits. Instead of commit messages like “added an index object, plus some bug fixes along the way,'' an expert Git author would create two commits, one with the message “added an index object'' and one with “bug fixes.'' These authors have such control because nothing is added to the index by default, so they can add only enough to express one precise change in the code, write the index to a commit object, then add a new set of items to a clean index to generate the next commit object. I found one blogger who took several pages to describe his commit routine: “For the most complicated cases, I will print out the diffs, read them over, and mark them up in six colors of highlighter…'' However, until you become a Git expert, this will be much more control over the index than you really need or want. That is, not using -a with git commit is an advanced use that many people never bother with. In a perfect world, the -a would be the default, but it isn't, so don't forget it.

Calling git commit -a writes a new commit object to the repository based on all the changes the index was able to track, and clears the index. Having saved your work, you can now continue to add more. Further—and this is the real, major benefit of revision control so far—you can delete whatever you want, confident that it can be recovered if you need it back. Don't clutter up the code with large blocks of commented-out obsolete routines—delete!

A useful tip: After you commit, you will almost certainly slap your forehead and realize something you forgot. Instead of performing another commit, you can run git commit --amend -a to redo your last commit.

An aside: Diff/Snapshot Duality

Physicists sometimes prefer to think of light as a wave and sometimes as a particle; similarly, a commit object is sometimes best thought of as a complete snapshot of the project at a moment in time and sometimes as a diff from its parent. From either perspective, it includes a record of the author, the name of the object (as we'll see later), the message you attached via the -m flag, and (unless it is the initial commit) a pointer to the parent commit object(s).

Internally, is a commit a diff or a snapshot? It could be either or both. There was once a time when Git always stored a snapshot, unless you ran git gc (garbage collect) to compress the set of snapshots into a set of deltas (aka diffs). Users complained about having to remember to run git gc, so it now runs automatically after certain commands, meaning that Git is probably (but by no means always) storing diffs. [end aside]

Having generated a commit object, your interactions with it will mostly consist of looking at its contents. You'll use git diff to see the diffs that are the core of the commit object and git log to see the metadata.

The key metadata is the name of the object, which is assigned via an unpleasant but sensible naming convention: the SHA1 hash, a 40-digit hexadecimal number that can be assigned to an object, in a manner that lets us assume that no two objects will have the same hash, and that the same object will have the same name in every copy of the repository. When you commit your files, you'll see the first few digits of the hash on the screen, and you can run git log to see the list of commit objects in the history of the current commit object, listed by their hash and the human-language message you wrote when you did the commit (and see git help log for the other available metadata). Fortunately, you need only as much of the hash as will uniquely identify your commit. So if you look at the log and decide that you want to check out revision number fe9c49cddac5150dc974de1f7248a1c5e3b33e89, you can do so with:

git checkout fe9c4

This does the sort of time-travel via diffs that patch almost provided, rewinding to the state of the project at commit fe9c4.

Because a given commit only has pointers to its parents, not its children, when you check git log after checking out an old commit, you will see the trace of objects that led up to this commit, but not later commits. The rarely used git reflog will show you the full list of commit objects the repository knows about, but the easier means of jumping back to the most current version of the project is via a tag, a human-friendly name that you won't have to look up in the log. Tags are maintained as separate objects in the repository and hold a pointer to a commit object being tagged. The most frequently used tag is master, which refers to the last commit object on the master branch (which, because we haven't covered branching yet, is probably the only branch you have). Thus, to return from back in time to the latest state, use:

git checkout master

Getting back to git diff, it shows what changes you have made since the last committed revision. The output is what would be written to the next commit object via git commit -a. As with the output from the plain diff program, git diff > diffs will write to a file that may be more legible in your colorized text editor.

Without arguments, git diff shows the diff between the index and what is in the project directory; if you haven't added anything to the index yet, this will be every change since the last commit. With one commit object name, git diff shows the sequence of changes between that commit and what is in the project directory. With two names, it shows the sequence of changes from one commit to the other:

git diff               #Show the diffs between the working directory and the index.
git diff --staged      #Show the diffs between the index and the previous commit.
git diff 234e2a        #Show the diffs between the working directory and the given commit object.
git diff 234e2a 8b90ac #Show the changes from one commit object to another.

A useful tip: There are a few naming conveniences to save you some hexadecimal. The name HEAD refers to the last checked-out commit. This is usually the tip of a branch; when it isn't, git error messages will refer to this as a “detached HEAD.''

Append ~1 to a name to refer to the named commit's parent, ~2 to refer to its grandparent, and so on. Thus, all of the following are valid:

git diff HEAD~4        #Compare the working directory to four commits ago.
git checkout master~1  #Check out the predecessor to the head of the master branch.
git checkout master~   #Shorthand for the same.
git diff b0897~ b8097  #See what changed in commit b8097.

At this point, you know how to:

  • Save frequent incremental revisions of your project.
  • Get a log of your committed revisions.
  • Find out what you changed or added recently.
  • Check out earlier versions so that you can recover earlier work if needed.

Having a backup system organized enough that you can delete code with confidence and recover as needed will already make you a better writer.

The Stash

Commit objects are the reference points from which most Git activity occurs. For example, Git prefers to apply patches relative to a commit, and you can jump to any commit, but if you jump away from a working directory that does not match a commit you have no way to jump back. When there are uncommitted changes in the current working directory, Git will warn you that you are not at a commit and will typically refuse to perform the operation you asked it to do. One way to go back to a commit would be to write down all the work you had done since the last commit, revert your project to the last commit, execute the operation, then redo the saved work after you are finished jumping or patching.

Thus we employ the stash, a special commit object mostly equivalent to what you would get from git commit -a, but with a few special features, such as retaining all the untracked junk in your working directory. Here is the typical procedure:

git stash # Code is now as it was at last checkin.
git checkout fe9c4

# Look around here.

git checkout master    # Or whatever commit you had started with
# Code is now as it was at last checkin, so replay stashed diffs with:
git stash pop

Another sometimes-appropriate alternative for checking out given changes in your working directory is git reset --hard, which takes the working directory back to the state it was in when you last checked out. The command sounds severe because it is: you are about to throw away all work you have done since the last checkout.

Trees and Their Branches

There is one tree in a repository, which got generated when the first author of a new repository ran git init. You are probably familiar with tree data structures, consisting of a set of nodes, where each node has links to some number of children and a link to a parent (and in exotic trees like Git's, possibly several parents).

Indeed, all commit objects but the initial one have a parent, and the object records the diffs between itself and the parent commit. The terminal node in the sequence, the tip of the branch, is tagged with a branch name. For our purposes, there is a one-to-one correspondence between branch tips and the series of diffs that led to that branch. The one-to-one correspondence means we can interchangeably refer to branches and the commit object at the tip of the branch. Thus, if the tip of the master branch is commit 234a3d, then git checkout master and git checkout 234a3d are entirely equivalent (until a new commit gets written, and that takes on the master label). It also means that the list of commit objects on a branch can be rederived at any time by starting at the commit at the named tip and tracing back to the origin of the tree.

The typical custom is to keep the master branch fully functional at all times. When you want to add a new feature or try a new thread of inquiry, create a new branch for it. When the branch is fully functioning, you will be able to merge the new feature back into the master using the methods to follow.

There are two ways to create a new branch splitting off from the present state of your project:

git branch newleaf       # Create a new branch...
git checkout newleaf     # then check out the branch you just created.
    # Or execute both steps at once with the equivalent:
git checkout -b newleaf

Having created the new branch, switch between the tips of the two branches via git checkout master and git checkout newleaf.

What branch are you on right now? Find out with:

git branch

which will list all branches and put a * by the one that is currently active.

What would happen if you were to build a time machine, go back to before you were born, and kill your parents? If we learned anything from science fiction, it's that if we change history, the present doesn't change, but a new alternate history splinters off. So if you check out an old version, make changes, and check in a new commit object with your newly made changes, then you now have a new branch distinct from the master branch. You will find via git branch that when the past forks like this, you will be on (no branch). Untagged branches tend to create problems, so if ever you find that you are doing work on (no branch), then run git branch -m new_branch_name to name the branch to which you've just splintered.

Sidebar: Visual Aids

There are several graphical interfaces to be had, which are especially useful when tracing how branches diverged and merged. Try gitk or git gui for Tk-based GUIs, tig for a console (curses) based GUI, or git instaweb to start a web server that you can interact with in your browser, or ask your package manager or Internet search engine for several more.

Merging

So far, we have generated new commit objects by starting with a commit object as a starting point and applying a list of diffs from the index. A branch is also a series of diffs, so given an arbitrary commit object and a list of diffs from a branch, we should be able to create a new commit object in which the branch's diffs are applied to the existing commit object. This is a merge. To merge all the changes that occurred over the course of newleaf back into master, switch to master and use git merge:

git checkout master
git merge newleaf

For example, you have used a branch off of master to develop a new feature, and it finally passes all tests; then applying all the diffs from the development branch to master would create a new commit object with the new feature soundly in place.

Let us say that, while working on the new feature, you never checked out master and so made no changes to it. Then applying the sequence of diffs from the other branch would simply be a fast replay of all of the changes recorded in each commit object in the branch, which Git calls a fast-forward.

But if you made any changes to master, then this is no longer a simple question of a fast application of all of the diffs. For example, say that at the point where the branch split off, gnomes.c had:

short int height_inches;

In master, you removed the derogatory type:

int height_inches;

The purpose of newleaf was to convert to metric:

short int height_cm;

At this point, Git is stymied. Knowing how to combine these lines requires knowing what you as a human intended. Git's solution is to modify your text file to include both versions, something like:

<<<<<<< HEAD
int height_inches;
=======
short int height_cm;
>>>>>>> 3c3c3c

The merge is put on hold, waiting for you to edit the file to express the change you would like to see. In this case, you would probably reduce the five-line chunk Git left in the text file to:

int height_cm;

Here is the procedure for committing merges in a non-fast-forward, meaning that there have been changes in both branches since they diverged:

  • Run git merge other_branch.
  • In all likelihood, get told that there are conflicts you have to resolve.
  • Check the list of unmerged files using git status.
  • Pick a file to manually check on. Open it in a text editor and find the merge-me marks if it is a content conflict. If it's a filename or file position conflict, move the file into place.
  • Run git add your_now_fixed_file.
  • Repeat steps 3--5 until all unmerged files are checked in.
  • Run git commit to finalize the merge.

Take comfort in all this manual work. Git is conservative in merging and won't automatically do anything that could, under some storyline, cause you to lose work.

When you are done with the merge, all of the relevant diffs that occurred in the side branch are represented in the final commit object of the merged-to branch, so the custom is to delete the side branch:

git delete other_branch

The other_branch tag is deleted, but the commit objects that led up to it are still in the repository for your reference.

The Rebase

Say you have a main branch and split off a testing branch from it on Monday. Then on Tuesday through Thursday, you make extensive changes to both the main and testing branch. On Friday, when you try to merge the test branch back into the main, you have an overwhelming number of little conflicts to resolve.

Let's start the week over. You split the testing branch off from the main branch on Monday, meaning that the last commits on both branches share a common ancestor of Monday's commit on the main branch. On Tuesday, you have a new commit on the main branch; let it be commit abcd123. At the end of the day, you replay all the diffs that occurred on the main branch onto the testing branch:

git branch testing  # get on the testing branch
git rebase abcd123  # or equivalently: git rebase main

With the rebase command, all the changes made on the main branch since the common ancestor are replayed on the testing branch. You might need to manually merge things, but by only having one day's work to merge, we can hope that the task of merging is more manageable.

Now that all changes up to abcd123 are present in both branches, it is as if the branches had actually split off from that commit, rather than Monday's commit. This is where the name of the procedure comes from: the testing branch has been rebased to split off from a new point on the main branch.

You also perform rebases at the end of Wednesday, Thursday, and Friday, and each of them is reasonably painless, as the testing branch kept up with the changes on the main branch throughout the week.

Rebases are often cast as an advanced use of Git, because other systems that aren't as capable with diff application don't have this technique. But in practice rebasing and merging are about on equal footing: both apply diffs from another branch to produce a commit, and the only question is whether you are tying together the ends of two branches (in which case, merge) or want both branches to continue their separate lives for a while longer (in which case, rebase). The typical usage is to rebase the diffs from the master into the side branch, and merge the diffs from the side branch into the master, so there is a symmetry between the two in practice. And as noted, letting diffs pile up on multiple branches can make the final merge a pain, so it is good form to rebase reasonably often.

Remote Repositories

Everything to this point has been occurring within one tree. If you cloned a repository from elsewhere, then at the moment of cloning, you and the origin both have identical trees with identical commit objects. However, you and your colleagues will continue working, so you will all be adding new and different commit objects.

Your repository has a list of remotes, which are pointers to other repositories related to this one elsewhere in the world. If you got your repository via git clone, then the repository from which you cloned is named origin as far as the new repository is concerned. In the typical case, this is the only remote you will ever use.

When you first clone and run git branch, you'll see one lonely branch, regardless of how many branches the origin repository had. But run git branch -a to see all the branches that Git knows about, and you will see those in the remote as well as the local ones. If you cloned a repository from Github, et al, you can use this to check whether other authors had pushed other branches to the central repository.

Those copies of the branches in your local repository are as of the first time you pulled. Next week, to update those remote branches with the information from the origin repository, run git fetch.

Now that you have up-to-date copies of the remote branches in your repository, you could merge one with the local branch you are working on using the full name of the remote branch, for example, git merge remotes/origin/master.

Instead of the two-step git fetch; git merge remotes/origin/master, you can update the branch via

git pull origin master

which fetches the remote changes and merges them into your current repository all at once.

The converse is push, which you'll use to update the remote repository with your last commit (not the state of your index or working directory). If you are working on a branch named bbranch and want to push to the remote with the same name, use:

git push origin bbranch

There are good odds that when you push your changes, applying the diffs from your branch to the remote branch will not be a fast-forward (if it is, then your colleagues haven't been doing any work). Resolving a non-fast-forward merge typically requires human intervention, and there is probably not a human at the remote. Thus, Git will allow only fast-forward pushes. How can you guarantee that your push is a fast-forward?

  • Run git pull origin bbranch to get the changes made since your last pull.
  • Merge as seen earlier, wherein you as a human resolve those changes a computer cannot.
  • Run git commit -a -m "dealt with merges".
  • Run git push origin bbranch, because now Git only has to apply a single diff, which can be done automatically.

To this point, I have assumed that you are on a local branch with the same name as the remote branch (probably master on both sides). If you are crossing names, give a colon-separated pair of source:destination branch names.

git fetch origin new_changes:master #Merge remote new_changes into local master
git push origin my_fixes:version2   #Merge the local branch into a differently named remote.
git push origin :prune_me           #Delete a remote branch.
git fetch origin new_changes:       #Pull to no branch; create a commit named FETCH_HEAD.

None of these operations change your current branch, but some create a new branch that you can switch to via the usual git checkout.

Sidebar: The Central Repository

Despite all the discussion of decentralization, the easiest setup for sharing is still to have a central repository that everybody clones, meaning that everybody has the same origin repository. This is how downloading from Github and Savannah typically works. When setting up a repository for this sort of thing, use git init --bare, which means that nobody can actually do work in that directory, and users will have to clone to do anything at all. There are also some permissions flags that come in handy, such as --shared=group to allow all members of a POSIX group to read and write to the repository.

You can't push to a branch in a nonbare remote repository that the repository owner has checked out; doing so will cause chaos. If this happens, ask your colleague to git branch to a different branch, then push while the target branch is in the background.

Or, your colleague can set up a public bare repository and a private working repository. You push to the public repository, and your colleague pulls the changes to his or her working repository when convenient. [end sidebar]

The structure of a Git repository is not especially complex: there are commit objects representing the changes since the parent commit object, organized into a tree, with an index gathering together the changes to be made in the next commit. But with these elements, you can organize multiple versions of your work, confidently delete things, create experimental branches and merge them back to the main thread when they pass all their tests, and merge your colleagues' work with your own. From there, git help and your favorite Internet search engine will teach you many more tricks and ways to do these things more smoothly.


9 December 14.

A table of narratives and distributions

link PDF version

Remember your first probability class? At some point, you covered how, if you take sets of independent and identically distributed (iid) draws from a pool of items, it can be proven that the means of those draws will be Normally distributed.

But if you're like the great majority of the college-educated population, you never took a probability class. You took a statistics class, where the first few weeks covered probability, and the next few weeks covered other, much more structured models. For example, the derivation of the linear regression formula is based on assumptions regarding an affine relation between variables and the minimization of an objective that is sensible, but just as sensible as many other possible objectives.

Another way to put it is that the Normal Distribution assumptions are bottom-up, with statements about each draw that lead to an overall shape, while the linear regression is top-down, assuming an overall shape and deriving item-level information like the individual error terms from the global shape.

There are lots of bottom-up models with sound microfoundations, each a story of the form if each observation experienced this process, then it can be proven that you will observe a distribution like this: Polya urns, Poisson processes, orthogonal combinations of the above. In fact, I'm making a list.

Maybe you read the many posts on this blog [this post, et seq] about writing and using functions to transform well-defined models into other well-defined models. A chain of such transformations can lead to an increasingly nuanced description of a certain situation. But you have to start the chain somwhere, and so I started compiling this list.

I've been kicking around the idea of teaching a probability-focused stats class (a colleague who runs a department floated the idea), and the list of narrative/distribution pairs linked above would be the core of the first week or two. You may have some ideas of where you'd take it from here; me, I'd probably have the students code up some examples to confirm that convergence to the named distribution occurs, which leads to discussion of fitting data to closed-form distributions and testing claims about the parameters; and then start building more complex models from these basic models, which would lead to more theoretical issues like decomposing joint distributions into conditional parts, and estimation issues like Markov Chains. Every model along the way would have a plausible micro-story underlying it.

This post is mostly just to let you know that the list of narrative/distribution pairs mentioned above exists for your reference. But it's also a call for your contributions, if your favorite field of modeling includes distributions I haven't yet mentioned, or if you otherwise see the utility in expanding the text further.

I've tried to make it easy to add new narrative/distribution pairs. The project is hosted on GitHub, which makes collaboration pretty easy. You don't really have to know a lot about git (it's been on my to-do list to post the git chapter of 21st Century C on here, but I'm lazy). If you have a GitHub account and fork a copy of the repository underlying the narrative/distribution list, you can edit it in your web browser; just look for the pencil icon.

Technical

The formatting looks stellar on paper and in the web browser, if I may say so, including getting the math and the citations right. There isn't a common language that targets both screen and text for this sort of application, so I invented one, intended to be as simple as possible:
Items(
∙ Write section headers like Section(Title here)
∙ Write emphasized text like em(something important)
∙ Write citation tags like Citep(fay:herriot)
∙ and itemized lists like this.
)

See the tech guide associated with the project for the full overview.

Pandoc didn't work OK for me, and Markdown gets really difficult when you have technical documents. When things like ++ and * are syntactically relevant, mentioning C++ will throw everything off, and the formatting of y = a * b * c will be all but a crapshoot.

On the back end, for those who are interested, these formatting functions are m4 macros that expand to LaTeX or HTML as needed. I wrote the first draft of the macros for this very blog, around June 2013, when I got tired of all the workarounds I used to get LaTeX2HTML to behave, and started entrusting my math rendering to MathJax. The makefiles prep the documents and send them to LaTeX and BibTeX for processing, which means that you'll need to clone the repository to a box with make and LaTeX installed to compile the PDF and HTML.

But the internals are no matter. This is a document that could, with further contributions from you, become a very useful reference for somebody working with probability models—and not just students, because, let's all admit it, working practitioners don't remember all of these models. It is implemented using a simple back-end that could be cloned off and used for generating collaborative technical documents of equal (or better!) quality in any subject.