29 August 16.

D3: a travelogue

link PDF version

Here's my largest project in D3: a visual calculator for your U.S. 2015 taxes. It doesn't look much like a scatterplot, but it uses a tool, Data Driven Documents (D3), used extensively for traditional dots-on-a-grid plots. I'm going to discuss the structure of this system, in the context of the last entry, which went over the parts of the standard data viz workflow (SDVW). If you've never used D3 and the acronym DOM doesn't have meaning to you, reading this may soften the initial blow of trying to use it.


Document Object Model. The document on your web page screen is a parent object with a well-defined set of child objects, akin to standard programming language structs that hold other structs, or XML documents whose elements hold other XML elements. Each struct holds elements of any sort: scalars, arrays, sub-objects, functions. Being a simple implementation of the struct with sub-structs, the DOM as a whole is a work of clean generality.

The DOM is a work of massive hyper-specificity, as each type of object—header, canvas, Scalable Vector Graphic, rectangle, text, whatever—has associated its own set of special properties which your browser or other reader will use to render or act upon the object. Your browser's developer tools will show you the full tree and associated properties. Further, there is no fixed list of what those properties are. If you want to add falafelness, and then asssign a javascript function to the object's deepFry property to modify the object's CSS styles based on its falafelness, all that is entirely valid. This right to assign arbitrary magic words is also open to the author of any library, D3 included.

Further, there are the quirks of history, as these neat objects represent web pages which include different HTML markers like IDs and spans, which tie in to cascading style sheets (CSS). Add it all together and any one object has attributes, tags, styles, properties, content, events. Changing an object is almost always a straightforward tweak—once you win the seek-and-find of determining which attribute, tag, style, property, content, or event to modify.

The HTML Histogram

Stepping back from the potential for massive complexity, here's is a simple demo of a horizontal-bar histogram. The table has four rows, each with an object of class bar, where the style characteristics of any bar object are listed in the header. Then in the table, the width style is set for each individual bar, like style="width:200px;".

.bar {height: 10px; border: 2px solid;  color: #2E9AFE;}

<tr><td>Joe</td> <td><div class="bar"    style="width: 80px;"/></td></tr>
<tr><td>Jane</td> <td><div class="bar"   style="width:300px;"/></td></tr>
<tr><td>Jerome</td> <td><div class="bar" style="width:300px;"/></td></tr>
<tr><td>Janet</td> <td><div class="bar"  style="width:126px;"/></td></tr>


Using any programming language at all, you could write a loop to produce a row of this table for each observation in your data set (plus the requisite HTML header and footer). Using only HTML and CSS, you've generated an OK data visualization.

You've probably already started on the SDVW in your head, and are thinking that the spacing is too big, or the shade of blue is boring, or the label fonts are wrong. And, of course, everything you would need to make those changes is in the DOM somewhere. Want the bars to go up instead of rightward? Set the height style on a per-bar basis instead of the width (and rearrange the table...). Want to give the girls pink bars? Neither I nor the HTML standard can stop you from setting color conditional on another column in the data set.

The Grammar of Graphics

The GoG is a book by Leland Wilkinson, subsequently implemented as various pieces of software, the most popular of which seems to be ggplot2 for R. I have no idea whether the HTML histogram or the GoG came first, but their core concept is the same: the objects on the screen—one per observation—have a set of characteristics (herein æsthetics), and we should be able to vary any of them based on the data. For example, maybe box height represents an observed value and color represents a statistical confidence measure.

Gnuplot and earlier plotting programs don't think of plots as objects with æsthetics. The æsthetic built in to the top-level plot command is (X, Y, Z) position, and others can be linked to data if there is a middle-layer function that was written to do so. ggplot2 is much more flexible, and every(?) æsthetic is set via the top-level command, but each geometry still has a fixed list of æsthetics. After all, if you want to do something unusual like change the axes' thickness based on data, somebody had to write that up using R's base-layer graphics capabilities.

Applying the GoG principle, setting object æsthetics using data, to the object properties of the DOM is natural to the point of being obvious, as per the HTML histogram. D3 just streamlines the process. You provide a data set, and it generates the right number of objects in a scatterplot, or bars, or graph, and applies your æsthetic rules to each. It provides an event loop so that the points can be redrawn on demand, so a button can switch how the drawing is done, censor some points and uncensor others, or update on a modified data set. The workflow starts with a set of top-layer commands corresponding to plot types, with ring plots, certain network plots, tried-and-true bar charts and scatterplots. To make the tax graph, I used dagre-d3, a top layer to draw directed graphs.

So we can get a lot done at the top layer in a GoG-type implementation, because a lot of the little tweaks we want to make turn out to fit this concept of applying data to an æsthetic. For those that don't, we have the ability to modify every property, attribute, tag, and style. I'm not sure if any custom-written base layer of any data viz package will ever be able to achieve that sort of generality, and the SDVW via D3 feels accordingly different from the SDVW in a system with a fixed set of middle-layer commands to tweak a plot.

As a tourist, the difficulties I had with D3 were primarily about getting to know the DOM and its many idiosyncracies. The documentation clearly expects that you already know how to manipulate objects and that the workings of the HTML histogram is more-or-less obvious to you. But this is also D3's strength. I had a mini-rant last time about the documentation of dataviz packages, which are often lacking in the item-by-item specifics one needs to make item-by-item changes to a plot. Meanwhile, web developers do nothing better than write web pages documenting web page elements, so the problem is not about finding hidden information but managing the sense of being overwhelmed. You are empowered to make exactly the visualization you want.

28 August 16.

Dataviz packages: a travelogue

link PDF version

I've toyed with base R, ggplot2, Gnuplot, matplotlib, Bokeh, D3, some D3 add-ons I've already forgotten about, some packages I tried for five minutes and walked away from. This is a post by a tourist about the problems these packages face and how they try to solve them, and why being a tourist like this is such hard going. Next episode will be entitled `D3: a travelogue', because I think D3 is worth extra attention. Keeping to my character, this series will be largely conceptual and have no visuals.

The short version of the plotting problem: so many elements, each of which needs to be tweaked. On the basic scatterplot, you've got axis labels, major and minor tic marks, scale (linear? log?), a grid (if any), the points themselves and their shape, size, positions, color, transparency, a key or legend and its content. And a title.

Meanwhile, everybody wants to be able to type plot(height, weight) and get a plot. A system that doesn't have that capability of one-command plot generation would have such a high hurdle to adoption that it would need regular exposure on the cover of the New York Times to get any traction.

The three-layer model

I think of all of these dataviz packages as three-layer systems:

Bottom layer In the end, we have to get dots on the canvas, be it a Mac/X/Windows/browser window, a png, a gif, an SVG, a LaTeX diagram, or ideally all of the above. As with any effort toward device-independence, every package author works out a set of primitives to draw lines, paint dots, write text sideways. Sometimes they let you see them.

Middle layer Step past abstract drawing and manipulate elements we commonly accept as part of a plot: axes, points, histogram bars.

Top layer Build the entire picture with one command: plot(height, weight).


Let the standard data viz workflow (SDVW) consist of starting with the one-line command from the top layer to generate a plot, then applying a series of middle-layer commands to adjust the defaults to something more appropriate to the situation.

I really do think this is the standard. It is generally how I see people talk about a plot (`It's basically OK, but this one thing looks off'). It is easy to understand and implement: get the basic plot up, then tick through every element that needs modification, one by one. What could go wrong?

The most common other workflow may be the exploratory workflow: add a top layer command to a script or call one from the command line, try to learn something from the plot displayed, then throw it away. When working to a tight deadline for people who think Excel charts look good, the workflow may be to call the top layer command and let all of the defaults stand.

Ease of (initial) use

Ease of initial use is when the top-layer command quickly produces something attractive out of the box, and is what is demanded by the exploratory workflow and the under-a-deadline workflow. It is also often what is demanded by people new to the package, who want to see fast positive feedback.

Ease of use is about the middle layer. No matter how stellar the defaults, your are plotting unique data, so it is a probability zero event that the defaults are exactly correct. For the same reason, the SDVW involves slightly different changes every time.

There's a neat mapping between the SDVW and ease of (initial) use: is the focus on making the top layer or the middle layer fast and easy? In an ideal world, they would both be, but our time on Earth is finite, and package authors have to lean toward one or the other.

To give an example of a system that prefers ease of use over ease of initial use and thus maps nicely to the SDVW, we have that unfashionable workhorse Gnuplot. At the top layer, start with a plot or splot command to generate the plot. At the middle layer, every component either has a set command which you can find in the awkward interactive help system, or can't be modified so stop trying. If you want to layer components, like a regression line through a scatterplot or a Joy Division plot, use replot. So the inital plot is one command, everybody thinks the defaults are ugly or broken, but there is a straightforward mapping between changes you may want to make and set commands to implement them.


Although our ideal in most software engineering is to have orthogonal components where each tweak stands independent of every other, a plot is unified. Text labels should be close to the points they are marking, but close is defined by the context of point density, space on the page, text font. As a user, I don't even want orthogonality: if I expand the text font, I want the definition of close to self-update so nothing gets cut off.

The smarter the top layer gets, the more difficult it is to get the middle layer to behave straightforwardly, and the less likely it is that there's a Gnuplot-like set command that just makes one tweak. For example, in some systems, adding a second $y$-axis is basically orthogonal to everything else and so is trivial; in ggplot2 axes are more closely tied to everything so adding a second $y$-axis is a pain.

Documentation by example

Documentation in this genre leans toward examples of the SDVW. Here are some bl.ocks, now cut/paste/modify them to what you want. I found this example-driven form to be amazingly consistent across packages and authors in this space. It's not just the official documentation and blog entries: I have access to Safari Books self-promotion disclaimer: every time you read a page from 21st Century C there, they pay me a fraction of a cent., where I went hoping to find full expositions going beyond worked SDVW examples, but I instead found more detailed and extensive SDVWs. I'd check a question at Stack Overflow, and the answer is a complete example preceded by `here, try this.' This is in contrast to other times I've gone to Stack Overflow and get the usual paragraphs of mansplaining about a single line of code.

Example-driven documentation is a corollary to the lack of orthogonality. How does it make sense to modify the points in a scatterplot when you haven't even produced the plot? But because each SDVW is different, either the example did what you are trying to do and you've won, or it isn't and you are right where you started.

Every package provides examples of the top three or four things you can do to modify the axes, but surprisingly few take the time to provide a boring page listing all the things you can do with the axes and how, or the page is sparse and descriptions point you back to the examples. Sometimes the answer on Stack Overflow would be a one-setting tweak which is never mentioned in the official documentation. I'm harping on documentation because I found it to be the biggest indicator of whether the authors were shooting for ease of initial use or ease of use, and on a cultural note it shows a clear difference between how dataviz package authors understand their users and how general data analysis package authors see their users.

That concludes part one of my travelogue. These systems do amazing things, but this is my confession that the full SDVW is still a slog to me. The standard data viz workflow is, to the best of my understanding, standard, yet being a tourist across packages required adapting to a new idiosyncratic way of walking through the SDVW every time. This may be because of the vagaries of how different base layers were designed, the (possibly hubris-driven) sales pitch that all you need is a top-layer command and you'll never need to change anything else, or the fundamental non-orthogonality of a data visualization.

8 August 16.

Murphy bed projects

link PDF version

People in sitcoms have jobs. They have a routine that allows similar things to happen every week. If people in movies have jobs, the job is an irrelevance mentioned in passing, or they are full-time spy assassin hunters.

I want my work narrative to be about projects rather than a continuous stream of existence with no set ending. People in movies lead more interesting lives.

I've made a real effort to switch over to project-oriented thinking all the time, and it does feel better. I sit down to work, and I see a set of finite things to build, not a never-ending slog. Everything (including the admin stuff) is in its own repository to check out as needed.

Murphy bed projects

To give another metaphor: the Murphy beds you find in tiny studio apartments. Once the bed is folded down from the wall, it covers the space and there is nothing to do but be in bed; once the bed is folded back up into its closet, you don't think about the bed at all. I've had the joy of sleeping in a few, and I think they're great.

It clearly dovetails with the project-centric life. Work on the project until you hit your stopping point, then put up the Murphy bed, pack up the tool box, fold up the tent, or whatever other physical metaphor you want, and move on.

My home directory, conceptual view. via

If you leave the cat on the Murphy bed when you fold it up, you will know. The process of being ready to fold everything away forces the discipline of stopping to ask what needs cleaning away, what threads of thought are still open, and what is to be done about it all.

If I'm going to check out the project fresh every morning, I need a makefile describing every setup detail—that's a good thing. It should have a make clean target to throw out the things I know are temp files or that can be regenerated—that's useful. I'm using revision control, which tracks some files and doesn't track others, so I have to decide early whether a given file is important enough to track, and what to do about it if not—that's so much better than my old routine of getting up to my ears in semi-important files and then feeling overwhelmed and shoving them into a temp directory I never look at again.

There's a definite trend toward being able to fold even up the entire virtual computer, which is stored in The Cloud or on your repository of virtual images. Personally, I'm not there yet, because even on an æsthetic level this isn't doing much for me beyond the usual PC-as-server setup that I have. I mean, you're always going to be typing into something. Without The Cloud, I'm going to have the usual home directory and an archive of projects from which I can pull.

My home

What's in my home directory? An archive directory, a temp directory, and that's it.

My home directory, actual view

The archive directory holds all those project repositories, a library of PDFs and data sets, items from my history.

I'm trying to get rid of the temp directory but can't let go yet. I thought it was weird when Lisa Rost said she used her desktop as her temp directory, but now I see that she makes a lot of sense. At the end of the day, if there are stray temp files in my home, they are blatantly present and I want to destroy them.

As a half-digression, I've taken to keeping one (1) text file with all my little side-notes on all of my projects. The notes use Org mode, which is a common standard for writing outlines. My text editor vim, with the orgmode plugin folds the inactive outline segments out of view, so I get the same Murphy bed effect of starting with a near-blank slate, unfolding the current project's notes, and leaving everything else hidden away. This text file of side-notes has massively cut down on my count of stupid two-line text files, and also serves as my index of all in-progress projects.

So, setup is easy: when I want to work on a project, I open a new terminal (actually, I make a new work session in tmux), git clone a copy from the bare repository in the archive directory, unfold the segment in my notes, go. When I want to shut down the project for the day, rm -rf the directory, close the tmux session, fold away the notes, go make some tea.

Perhaps you tensed up as much as I do at the part where I rm -rf the directory. In fact, with version control there are several ways to lose data beyond just deleting a file.

  • Yes, I could delete an untracked file.
  • I could have a stash that gets deleted.
  • I could have diffs that I haven't committed.
  • I could have everything committed locally but not pushed to the archive.

So I wrote a script to check for all of these things. People have told me that some DVCS GUIs do some number of these things. It's turned out to be useful to have a command that I can call quickly, because I can call it all the time to check the state of things (even when I have terminal-only access).

I named the script git-isclean, and posted it on Github at that link. If it gives me green check marks, I'm done; else it will (with the -a flag) automatically help me with the next step in cleanup. It depends on the interactive status script I wrote before, because it makes perfect sense to use it; if you don't want to use that script, change the use of git istatus to git status.

Sample usage.

Given my goal of keeping my home directory empty save for the one project I am working on right now, the need for this script is obvious. But it is useful even in less sparse workflows, because it provides a little to-do list of loose ends, worth having any time.

I hope something there was useful to you or gave you some ideas. Last time I wrote a navel-gazing post about my workflow was in 2013. Maybe I'll do another one in 2019.

1 August 16.

Version control as narrative device

link PDF version

I'm a convert. I went from not getting distributed version control systems (DVCSes) to writing a book chapter on one and forcing it on my friends and coworkers. I originally thought it was about mechanics like easy merging and copying revisions, but have come to realize that its benefits are literary, not mechanical. This matters in how DVCSes are taught and understood.

The competition

One way to see this is to consider the differences among the many distributed version control systems we aren't using.

Off the top of my head, there's Git (by the guy who wrote Linux, for maintaining Linux), mercurial (adopted as the Python DVCS), fossil (by the guy who wrote SQLite, for SQLite development), bazaar (by GNU to support GNU projects). So, first, in terms of who `invented' DVCS, it seems like everybody saw the same problems with existing version control systems at about the same time, and developed responses at about the same time. All of these are big ticket projects that have a large network and geek mindshare.

But at this point, Git won the popularity contest. Why?

It's not the syntax, which is famously unclear and ad hoc. Want to set up a new branch based on your current work? You might start typing git branch, but you're better off using git checkout -b. There's a command to regraft a branch onto a different base on the parent branch, git rebase, which is of course what you would use if you want to squash two commits into one.

Did you stage a file to be committed but want to unstage it? You mean

git reset HEAD -- fileA

The git manual advises that you set up an alias for this:

git config --global alias.unstage 'reset HEAD --'

#and now just use

git unstage fileA

I find it telling that the maintainers decided to put these instructions in the manual, instead of just adding a frickin' unstage command.

Yes, there are front-ends that hide most of this, the most popular being a web site and service named github.com, but it's a clear barrier that anything one step beyond what the web interfaces do will be awkward.

Early on, git also had some unique technical annoyances—maybe you remember having to call git gc manually. It has a concept, the index that other systems don't even bother with.

All that said, it won.

The index and commit squashing

To go into further detail on the index that most DVCSes don't bother with, it is a staging area describing what your next commit will look like. Other systems simplify by just taking all changes in your working directory and bundling them into a snapshot. This is how I work about 98% of the time.

The other 2%, I have a few different things going on, say adding some text on two different parts of a document, and for whatever reason, I want to split them into two commits. The index lets me add one section, commit, then add the next and produce a second commit.

By splitting the mess of work into one item with one intent and a second with a new intent, I've imposed a narrative structure on my writing process. Each commit has some small intent attached, and I even have the ability to rearrange those intents to form a more coherent story. Did I know what I was doing and where I was going before I got there? Who knows, but it sure looks like I did in the commit log.

There's some debate about whether these rewritings are, to use a short and simple word, lying. The author of Fossil says (§3.7) that not being able to rewrite the history of the project development is a feature, because it facilitates auditing. You can find others who recommend against using git rebase for similar reasons. On the other side, there are people who insist on using git rebase, because it is rude to colleagues to not clean up after yourself. And I stumble across tutorials on how to remove accidentally-committed passwords often.

One side of the debate is generating an episodic narrative of how the project formed; the other is generating a series of news bulletins.

Personally, I am on the side of the episodic narrative writers. If you could install a keylogger that broadcasts your work, typos and all, would you do it? Would you argue that it is a matter of professional ethics that not doing so and only presenting your cleaned-up work is misleading to your coworkers?

Providing a means of squashing together the error and its fix loses the fact that the author made and caught a mistake, but that makes life more pleasant for the author and has no serious cost to others. Most information is not newsworthy.

You might draw the line in my ethical what-if at publication: once your work is out the door, correcting it should be accompanied by a statement. Git semi-enforces this by making it almost impossible to rebase after pushing to another repository. This seems to be more technical happenstance than a design decision.

On a social scale, revision control is again about building a narrative. Before, we were a bunch of kids running around the field kicking a ball around, and now we're a bunch of kids playing kickball. We don't throw tarballs or semi-functional patches at each other; we send discrete commits to discrete locations under specific rules, and use those to understand what colleagues were doing. We can read their commit history to know what they were doing, because they had tools to make the commit history a structured narrative.

I'd love to see somebody take my claim that revision control develops a narrative literally and actually use a DVCS to write literature. We are all familiar with interactive fiction; why would revision control fiction, with its built-in sense of time and branching, be so unusual?

The evangelical disconnect

This is the setup for so many awkward conversations. The evangelist has such zeal because of the structuring of unstructred work facilitated by the complicated features like the index and the rebase command; the newbie just wants to save snapshots. The evangelist may not be able to articulate his or her side, or the evangelist does talk about building a narrative, but a lot of people just do not care and see it as useless bureauracy, or don't have a concept of what it means until they are on the other side and have had a chance to try it themselves. I'd be interested to see tutorials that focus on the process of narrative writing, instead of rehashing the basics that are already covered so well.

In other evangelist news, I've saved you hours reading the million Git vs Other DVCS blog entries out there. The other one is easier to use, because it has fewer narrative control features.

If you buy my claim that Git's primary benefit over near-equivalent competitors is tools to rewrite the narrative, then Git's winning over alternatives is a strong collective revealed preference. Some of these same people who reject some programming languages because the semicolons look cluttery enthusiastically embrace a system where git symbolic-ref HEAD refs/heads/baseline is a reasonable thing to type. Can you work out what it does and why I needed it? That's a strong indication of just how high-value narrative control is among geekdom.

Of course, not everybody in a crowd is of like mind. DVCSes are socially interesting because they are something that working adults have to learn, with a real payoff after the learner gets over the hump. Like double-entry bookkeeping or playing the piano, the people who have taken the time to learn have a new means of understanding the world and generating narratives that people pre-hump may not see. It's interesting to see who has the desire to pursue that goal to its conclusion and who gives up after seeing the syntactic mess.

6 December 15.

Apophenia v1.0

link PDF version

Apophenia version 1.0 is out!

What I deem to be the official package announcement is this 3,000 word post at medium.com, which focuses much more on the why than the what or how. If you follow me on twitter then you've already seen it; otherwise, I encourage you to click through.

This post is a little more on the what. You're reading this because you're involved in statistical or scientific computing, and want to know if Apophenia is worth working with. This post is primarily a series of bullet points that basically cover background, present, and future.

A summary

Apohenia is a library of functions for data processing and modeling.

The PDF manual is over 230 pages, featuring dozens of base models and about 250 data and model manipulation functions. So if you're thinking of doing any sort of data analysis in C, there is probably already something there for you to not reinvent. You can start at the manual's Gentle Introduction page and see if anything seems useful to you.

For data processing, it is based on an apop_data structure, which is a lot like an R data frame or a Python Pandas data frame, except it brings the operations you expect to be able to do with a data set to plain C, so you have predictable syntax and minimal overhead.

For modeling, it is based on an apop_model structure, which is different from anything I've seen in any other stats package. In Stats 101, the term statistical model is synonymous with Ordinary Least Squares and its variants, but the statistical world is much wider than that, and is getting wider every year. Apophenia starts with a broad model object, of which ordinary/weighted least squares is a single instance (apop_ols).

By assiging a format for a single model structure:

  • We can give identical treatment to models across paradigms, like microsimulations, or probabilistic decision-tree models, or regressions.
  • We can have uniform functions like apop_estimate and apop_model_entropy that accommodate known models using known techniques and models not from the textbooks using computationally-intensive generic routines. Then you don't have to rewrite your code when you want to generalize from the Normal distributions you started with for convenience to something more nuanced.
  • We can write down transformations of the form f:(model, model) $\to$ model.
    • Want a mixture of an empirical distribution built from observed data (a probability mass function, PMF) and a Normal distribution estimated using that data?
      apop_model *mix = apop_model_mixture(
                                    apop_estimate(your_data, apop_pmf),
                                    apop_estimate(your_data, apop_normal)
    • You want to fix a Normal$(\mu, \sigma)$ at $\sigma=1$? It's a little verbose, because we first have to set the parameters with $\mu=$NaN and $\sigma=1$, then send that to the parameter-fixing function:
      apop_model *N1 = apop_model_fix_parameters(
                                    apop_model_set_parameters(apop_normal, NAN, 1));
    • You want to use your mixture of a PMF+Normal as a prior to the $\mu$ in your one-parameter Normal distribution? OK, sure:
      apop_model *posterior = apop_update(more_data,
                                          .prior=mix, .likelihood=N1);
    • You want to modify your agent-based model via a Jacobian [apop_coordinate_transform], then truncate it to data above zero [apop_model_truncate]? Why not—once your model is in the right form, those transformations know what to do.
  • In short, we can treat models and their transformations as an algebraic system; see a paper I once wrote for details.

What v1 means

  • It means that this is reasonably reliable.
    • Can the United States Census Bureau rely on it for certain aspects of production on its largest survey (the ACS)? Yes, it can (and does).
    • Does it have a test bed that checks that for correct data-shunting and good numeric results in all sorts of situations? Yes: I could talk all day about how much the 186 scripts in the test base do.
    • Is it documented? Yes: the narrative online documentation is novella length, plus documentation for every function and model, plus the book from Princeton University Press described on the other tabs on this web site, plus the above-linked Census working paper. There's a lot to cover, but an effort has been made to cover it.
    • Are there still bugs? Absolutely, but by calling this v1, I contend that they're relatively isolated.
    • Is it idiot-proof? Nope. For example, finding the optimum in a 20-dimensional space is still a fundamentally hard problem, and the software won't stop you from doing one optimization run with default parameters and reporting the output as gospel truth. I know somebody somewhere will write me an angry letter about how software that does not produce 100% verifiably correct results is garbage; I will invite that future correspondent to stick with the apop_normal and apop_ols models, which work just fine (and the OLS estimator even checks for singular matrices). Meanwhile, it is easy to write models that don't even have proven properties such as consistency (can we prove that as draw count $\to\infty$, parameter estimate variance $\to 0$?). I am hoping that Apophenia will help a smart model author determine whether the model is or is not consistent, rather than just printing error: problem too hard and exiting.

  • It means that it does enough to be useful. A stats library will never be feature-complete, but as per the series of blog posts starting in June 2013 and, well, the majority of what I've done for the last decade, it provides real avenues for exploration and an efficient path for many of the things a professional economist/statistician faces.

  • It means I'm no longer making compatibility-breaking changes. A lot of new facilities, including the named/optional arguments setup, vtables for special handling of certain models, a decent error-handling macro, better subsetting macros, and the apop_map facilities (see previously) meant that features implemented earlier merited reconsideration, but we're through all that now.

  • It's a part of Debian! See the setup page for instructions on how to get it from the Debian Stretch repository. It got there via a ton of testing (and a lot of help from Jerome Benoit on the Debian Scientific team), so we know it runs on a lot more than just my own personal box.

From here, the code base is in a good position to evolve:

  • The core is designed to facilitate incremental improvements: we can add a new model, or a new optimization method, or another means of estimating the variance of an existing model, or make the K-L divergence function smarter, or add a new option to an existing function, and we've made that one corner of the system better without requiring other changes or work by the user. The intent is that from here on out, every time the user downloads a new version of Apophenia, the interface stays the same but that the results get better and are delivered faster, and new models and options appear.

  • That means there are a lot of avenues for you and/or your students to contribute.

  • Did I mention that you'll find bugs? Report them and we'll still fix them.

  • It's safe to write wrappers around the core. I wrote an entire textbook to combat the perception that C is a scary monster, but if the user doesn't come to the 19,000-line mountain of code that is Apophenia, we've got to bring the mountain to the user.

By the way, Apophenia is free, both as in beer and as in speech. I forget to mention this because it is so obvious to me that software—especially in a research context—should be free, but there are people for whom this isn't so obvious, so there you have it.

A request

I haven't done much to promote Apophenia. A friend who got an MFA from an art school says that she had a teacher who pushed that you should spend 50% of your time producing art, and 50% of your time selling your art.

I know I'm behind on the promotion, so, please: blog it, tweet it, post it on Instagram, make a wikipage for it, invite me to give talks at your department. People will always reinvent already-extant code, but they should at least know that they're doing so.

And my final request: try it! Apophenia doesn't look like every other stats package, and may require seeing modeling from a different perspective, but that just may prove to be a good thing.

7 June 15.

Editing survey data (or, how to deal with pregnant men)

link PDF version

This post is partly a package announcement for Tea, a package for editing and imputation. But it mostly discusses the problem of editing, because I've found that a lot of people put no thought into this important step in data analysis.

If your data is a survey filled in by humans, or involves using sensors that humans operated, then your data set will have bad data.

But B, you object, I'm a downstream user of a data set provided by somebody else, and it's a clean set—there are no pregnant men—so why worry about editing? Because if your data is that clean, the provider already edited the data, and may or may not have told you what choices were made. Does nobody have gay parents because nobody surveyed has gay parents, or because somebody saw those responses and decided they must be an error? Part of the intent of Tea is to make it easy to share the specification of how the data was edited and missing data imputed, so end users can decide for themselves whether the edits make sense.

If you want to jump to working with Tea itself, start with the Tutorial/manual.

For editing bad values in accounting-type situations, you may be able to calculate which number in a list of numbers is incorrect—look up Fellegi-Holt imputation for details. But the situations where something can be derived like this are infrequent outside of accounts-based surveys.

So people wing it.

Some people will listwise delete the record with some failure in it, including a record that is missing entirely. This loses information, and can create a million kinds of biases, because you've down-weighted the deleted observation to zero and thus up-weighted all other observations. An analyst who deletes the observations with N/As for question 1 only when analyzing question 1, then restores the data set and deletes the N/As for question 2 when analyzing question 2, ..., is creating a new weighting scheme for every question.

I don't want to push you to use Tea, but I will say this: listwise deletion, such as using the ubiquitous na.rm option in R functions, is almost always the wrong thing to do.

Major Census Bureau surveys (and the decennial census itself) tend to lean on a preference ordering for fields and deterministic fixes. Typically, age gets fixed first, comparing it to a host of other fields, and in a conflict, age gets modified. Then age is deemed edited, and if a later edit involving age comes up it's complicated..., then age stays fixed and the other fields are modified. There is usually a deterministic rule that dictates what those modifications should be for each step. This is generally referred to as sequential editing.

Another alternative is to gather evidence from all the edits and see if some field is especially guilty. Maybe the pregnant man in our public health survey also reports using a diaphragm for contraception and getting infrequent pap smears. So of the pair (man, pregnant), it looks like (man) is failing several edits.

If we don't have a deterministic rule to change the declared-bad field, then all we can do is blank out the field and impute, using some model. Tea is set up to be maximally flexible about the model chosen. Lognormal (such as for incomes), Expectation-Maximization algorithm, simple hot deck draw from the nonmissing data, regression model—just set method: lognormal or EM or hot deck or ols in the spec file, and away you go.

Tea is set up to do either sequential or deterministic edits. Making this work was surprisingly complicated. Let me give you a worst-case example to show why.


Assume these edits: fail on
(age $<$ 15 and status=married),
(age $<$ 15 and school=PhD),
and (age$>$65 and childAge $<$ 10 $=>$ change childAge to age-25).
The third one has a deterministic if-then change attached; the first two are just edits.

I think these are reasonable edits to make. It is entirely possible for a 14 year old to get a PhD, just as there has existed a pregnant man. Nonetheless, the odds are much greater when your survey data gives you a 12-year old PhD or a pregnant man that somebody committed an error, intentional or not.

Then, say that we have a record with (age=14, married, PhD, childAge=5), which goes through these steps:

  • Run the record through the edits. Age fails two edits, so we will blank it out.
  • Send age to the imputation system, which draws a new value for the now-blank field. Say that it drew 67.
  • Run the record through the edits, and the deterministic edit hits: we have to change the child's age to 42.

First, this demonstrates why writing a coherent editing and imputation package is not a pleasant walk in the park, as edits trigger imputations which trigger edits, which may trigger more imputations.

Second, it advocates against deterministic edits, which can be a leading cause of these domino-effect outcomes that seem perverse when taken as a whole.

Third, it may advocate against automatic editing entirely. It's often the case that if we can see the record, we could guess what things should have been. It's very plausible that age should have been 41 instead of 14. But this is guesswork, and impossible to automate. From Census experience, I can say that what you get when you add up these little bits of manual wisdom for a few decades is not necessarily the best editing system.

But not all is lost. We've generated consistent micro-data, so no cross-tabulations will show anomalies. If we've selected a plausible model to do the imputations, it is plausible that the aggregate data will have reasonable properties. In many cases, the microdata is too private to release anyway. So I've given you a worst-case example of editing gone too far, but even this has advantages over leaving the data as-is and telling the user to deal with it.

Multiply impute

Imputation is the closely-allied problem of filling in values for missing data. This is a favored way to deal with missing data, if only because it solves the weighting problem. If 30% of Asian men in your sample didn't respond, but 10% of the Black women didn't respond, and your survey oversampled women by 40%, and you listwise delete the nonresponding observations, what reweighting should you do to a response by an Asian woman? The answer: side-step all of it by imputing values for the missing responses to produce a synthetic complete data set and not modifying the weights at all.

Yes, you've used an explicit model to impute the missing data—as opposed to the implicit model generated by removing the missing data. To make the model explicit is to face the reality that if you make any statements about a survey in which there is any missing data, then you are making statements about unobserved information, which can only be done via some model of that information. The best we can do is be explicit about the model we used—a far better choice than omitting the nonresponses and lying to the reader that the results are an accurate and objective measure of unobserved information.

We can also do the imputation multiple times and augment the within-imputation variance with the across-imputation variance that we'd get from different model-based imputations, to produce a more correct total estimate of our confidence in any given statistic. That is, using an explicit model also lets us state how much our confidence changes because of imputation.

There are packages to do multiple imputation without any editing, though the actual math is easy. Over in the other window, here's the R function I use to calculate total variance, given that we've already run several imputations and stored each (record,field,draw) combination in a fill-in table. The checkOutImpute function completes a data set using the fill-ins from a single imputation, generated by the imputation routine earlier in the code. There's, like, one line of actual math in there.


get_total_variance <- function(con, tab, col, filltab, draw_ct, statvar){
    v <- 0  #declare v and m
    m <- 0
    try (dbGetQuery(con, "drop table if exists tt"))
    for (i in 1:draw_ct){
        checkOutImpute(dest="tt", origin=tab, filltab=filltab, imputation_number=i-1)
        column <- dbGetQuery(con, paste("select ", col, " from tt"))
        vec <- as.numeric(column[,1]) #type conversions.
        v[i] <- statvar(vec)
        m[i] <- mean(vec)
    total_var <- mean(v) + var(m)/(1+1./draw_ct)
    return(c(mean(m), sqrt(total_var)))

# Here's the kind of thing we'd use as an input statistic: the variance of a Binomial
binom_var <- function(vec){
    p = mean(vec)