23 April 17.

Platforms for reliable research

link PDF version

This stems from a Twitter thread about how you can't do replicable science if some package won't be installable five years from now, which is not an unusual case for Python packages. Click on the lower date stamp for more (including another view of the communities I describe here):

Meanwhile, it was observed, C code from the 1980s routinely runs just fine. For example, I work with some large-scale examples that I could expound about. Or, when you type grep or ls, you are calling up a C program whose core was likely written in the 1980s or early 1990s. Unless it depends on specific hardware features, I think it's hard to find correctly written C code from the 1990s that doesn't compile and run today.

I've been using R since 2000, and a lot of my code from even five years ago won't run. We had a project which repeatedly hit problems where package A needed version 2.3.1 of the Matrix package, but not version 2.3.2 or later, but package B relies on version 2.4 of the same package. In the end, it killed our project: every time we had to re-tweak the install, our potential users perceived the project as a little less reliable.

So I honestly believe it's true that C code is more reliable and more likely to run a decade from now, and there's a wash of evidence to demonstrate this. Then, why?

I think there are two reasons: there are cultural differences in a language that has a standard, and the lack of a standard package manager(s) has been an inadvertent blessing here. These features have led to a social system that allows us to build technical systems that don't depend on people.

There's a standard

An interpreted language doesn't need a standard because if the interpreter got the right answer, then you've conformed 100%. In the long term, this is terrible. Raymond Chen, a kernel-ready programmer at Microsoft, has a blog that is filled with convoluted stories like this one, whose template is that authors of Windows programs found an undocumented/underdocumented part in the workings of Windows, then years later the Windows OS people try to change that part, while fully conforming to documented behavior, then get a million complaints as the undocumented trick breaks. That link is the first example I grabbed, but keep skimming; he's been blogging these stories for a decade or so.

This state is hard to picture in the world of plain C, because having a standard affects culture. I can't count how many people have written to tell me about some piece of code I put out in the public sphere that might compile on every modern compiler, but does not—or even may not—conform to the ISO standard. Not writing to the contract is a bug, whether it works for me or not.

Meanwhile, in the world of R or Python, the documentation could be seen as the contract. But do you really think every single R function documents its behavior when it receives a NaN or -Infinity argument? And with no culture of strict contractual obligation, there is little that prevents the authors from just rewriting the contract. The same holds in greater force with third-party packages, which are sometimes developed by people who have no experience writing contracts-in-documentation so that they are reasonably future-proof.

I liked this Twitter exchange. Each tweet is another variant of the statement that the R community has no coherent contract or contract conformance culture. Again, click the date stamp for the two replies.

So, contracts exist in C-world and elsewhere, but from everything I have seen, the cultural norms to take those contracts seriously are far stronger among C authors.

Towers

Azer Koçulu has a Javascript package named Left Pad which provides a function to pad a string or number with white space or zeros. That's the whole package: one function, to do something useful that I wouldn't want to get side-tracked into rewriting and testing, but which is not far from a Javascript 101 exercise.

It was a heavily-used micro-package, and not just by fans of left padding, as a data analysis package might use a table-making package, which would depend on Left Pad. When Mr Koçulu unpublished all his Node Package Manager submissions, this broke everything.

In practice, R packages tend to have several such microdependencies, while the authors of C packages tend to cut something like a left pad function from the library code base and paste it in to the code base of the project at hand. Authors of the GSL made an effort to write its functions so they could be cut and pasted. SQLite has a version in a single file, whose purpose is to allow authors to copy the file into their code base rather than call an external SQLite library.

I think it is the presence or lack of a standard package manager that led to this difference in culture and behavior. If you have the power to assume users can download any function, so seamlessly it may even happen without their knowledge, why wouldn't you use it all the time?

The logical conclusion of having a single, unified package manager is a tower of left-pad-like dependencies, where A depends on B and C, which depends on D and E, which depends on B as well.

The tower is a more fragile structure than just having four independent package dependencies. What happens when the author of B changes the interface a little, because contracts can change? The author of package A may be able to update to B version 2, but E depends on B version 1. Can you find the author of E, and convince him or her to update to B version 2, and to get it re-posted on CRAN so you can use it in a timely manner?

To get the package on to CRAN, the author of E will have to pass the tests he or she wrote on every machine R runs on. Douglas Bates threw up his hands over the problems this engenders:

I am the maintainer of the RcppEigen package which apparently also makes me the maintainer of an Eigen port to Solaris. When compilers on Solaris report errors in code from Eigen I am supposed to fix them. This is difficult in that I don't have access to any computers running Solaris .... So I have reached the point of saying "goodbye" to R, Rcpp and lme4 ...

It's good that CRAN wants to provide a guarantee that a package will compile on a given set of platforms. First, it does this because there is no culture of contracts, so the only way to know if an R/C++ package will compile on Solaris is to try it. Second, the guarantee is somewhat limited. Did the package get the right answers on all systems? If it had numeric tests, those tests passed, but the CRAN testing server knows nothing of how to test eigenvector calculations. I've bristled at more than enough people who have told me they trust the numbers a package spits out because it's on CRAN so it must have been tested or peer-reviewed.

In the short run, this is great. We've solved a lot of dependency-mismatch problems.

In the long run, we haven't solved anything. You come back in three years to re-run your analysis, and you find that package B was removed, because the author was unable to reply to requests to fix new issues, because she died. Or the latest version of package B is impossibly different, so you try the old one, and you find that none of the guarantees hold: to the best of my knowledge, CRAN doesn't back-test old versions of a package (what would they do if one broke?), so whether your old version of package B works with the 2020 edition of R and its ecosystem is as much a crap shoot as it ever was.

That was a lot of kvetching about package manager issues, but with the intent of comparing to C-world.

With no official package system, your C library has no means of automatically downloading dependencies, so forget about building a tower where you instruct readers to chase down a sequence of elements. Users will tolerate about two steps, and most won't even go that far. For code reuse, this has not been good, and we see people re-inventing well-established parts over and over again. But for reliability ten years down the road, we get a lemons-out-of-lemonade benefit that building unreliable towers of dependencies is impracticable.

GUI-based programs sometimes depend on a tower of dependencies, and a package manager like Red Hat's RPM or Debian's APT to install them. Such programs also typically have command-line versions just in case.

With no package manager, you have to give users the entire source code and let them rebuild. The build tools are standard as well. The norm is the makefile, which is defined in the POSIX standard. If you're using Autoconf, it will run a POSIX-compliant shell script to produce such a standard makefile.

There is nobody to check whether your library/package will build on other peoples' machines, which means that you have to stick to the contract. On the one hand, people screw this up, especially when they have only one compiler on hand to test against. On the other, if they get it right, then you are guaranteed that today's version of the code will compile in a decade.

I'm no Luddite: it's great that package managers exist, and it's great that I can write one line into the metadata to guarantee that all users on an Internet-enable computer e.g., who don't work on a secured system with an appropriate configuration can pull the dependencies in a second. It's great that there are systems monitoring basic package sanity.

But for the purposes of long-term research replicability, all of these boons are patches over the lack of a system of contracts or a culture that cares deeply about those contracts. Think back to how often I've mentioned people and social systems in my discussion of interpreters and package managers. Conversely, once you've zipped up code written to contract and a makefile-generator written to contract, you're irrelevant, your wishes are irrelevant (assuming the right licenses), and there is no platform that can deliberately or due to failure prevent a user from installing your package. People and their emotions and evolving preferences are out of the picture. In the end, that's really the secret to making sure that a platform can be used to generate code that is replicable years and decades from now.


29 August 16.

D3: a travelogue

link PDF version

Here's my largest project in D3: a visual calculator for your U.S. 2015 taxes. It doesn't look much like a scatterplot, but it uses a tool, Data Driven Documents (D3), used extensively for traditional dots-on-a-grid plots. I'm going to discuss the structure of this system, in the context of the last entry, which went over the parts of the standard data viz workflow (SDVW). If you've never used D3 and the acronym DOM doesn't have meaning to you, reading this may soften the initial blow of trying to use it.

The DOM

Document Object Model. The document on your web page screen is a parent object with a well-defined set of child objects, akin to standard programming language structs that hold other structs, or XML documents whose elements hold other XML elements. Each struct holds elements of any sort: scalars, arrays, sub-objects, functions. Being a simple implementation of the struct with sub-structs, the DOM as a whole is a work of clean generality.

The DOM is a work of massive hyper-specificity, as each type of object—header, canvas, Scalable Vector Graphic, rectangle, text, whatever—has associated its own set of special properties which your browser or other reader will use to render or act upon the object. Your browser's developer tools will show you the full tree and associated properties. Further, there is no fixed list of what those properties are. If you want to add falafelness, and then asssign a javascript function to the object's deepFry property to modify the object's CSS styles based on its falafelness, all that is entirely valid. This right to assign arbitrary magic words is also open to the author of any library, D3 included.

Further, there are the quirks of history, as these neat objects represent web pages which include different HTML markers like IDs and spans, which tie in to cascading style sheets (CSS). Add it all together and any one object has attributes, tags, styles, properties, content, events. Changing an object is almost always a straightforward tweak—once you win the seek-and-find of determining which attribute, tag, style, property, content, or event to modify.

The HTML Histogram

Stepping back from the potential for massive complexity, here's is a simple demo of a horizontal-bar histogram. The table has four rows, each with an object of class bar, where the style characteristics of any bar object are listed in the header. Then in the table, the width style is set for each individual bar, like style="width:200px;".

<html>
<head><style>
.bar {height: 10px; border: 2px solid;  color: #2E9AFE;}
</style></head>

<body>
<table>
<tr><td>Joe</td> <td><div class="bar"    style="width: 80px;"/></td></tr>
<tr><td>Jane</td> <td><div class="bar"   style="width:300px;"/></td></tr>
<tr><td>Jerome</td> <td><div class="bar" style="width:300px;"/></td></tr>
<tr><td>Janet</td> <td><div class="bar"  style="width:126px;"/></td></tr>
</table>
</body>
</html>

Joe
Jane
Jerome
Janet

Using any programming language at all, you could write a loop to produce a row of this table for each observation in your data set (plus the requisite HTML header and footer). Using only HTML and CSS, you've generated an OK data visualization.

You've probably already started on the SDVW in your head, and are thinking that the spacing is too big, or the shade of blue is boring, or the label fonts are wrong. And, of course, everything you would need to make those changes is in the DOM somewhere. Want the bars to go up instead of rightward? Set the height style on a per-bar basis instead of the width (and rearrange the table...). Want to give the girls pink bars? Neither I nor the HTML standard can stop you from setting color conditional on another column in the data set.

The Grammar of Graphics

The GoG is a book by Leland Wilkinson, subsequently implemented as various pieces of software, the most popular of which seems to be ggplot2 for R. I have no idea whether the HTML histogram or the GoG came first, but their core concept is the same: the objects on the screen—one per observation—have a set of characteristics (herein æsthetics), and we should be able to vary any of them based on the data. For example, maybe box height represents an observed value and color represents a statistical confidence measure.

Gnuplot and earlier plotting programs don't think of plots as objects with æsthetics. The æsthetic built in to the top-level plot command is (X, Y, Z) position, and others can be linked to data if there is a middle-layer function that was written to do so. ggplot2 is much more flexible, and every(?) æsthetic is set via the top-level command, but each geometry still has a fixed list of æsthetics. After all, if you want to do something unusual like change the axes' thickness based on data, somebody had to write that up using R's base-layer graphics capabilities.

Applying the GoG principle, setting object æsthetics using data, to the object properties of the DOM is natural to the point of being obvious, as per the HTML histogram. D3 just streamlines the process. You provide a data set, and it generates the right number of objects in a scatterplot, or bars, or graph, and applies your æsthetic rules to each. It provides an event loop so that the points can be redrawn on demand, so a button can switch how the drawing is done, censor some points and uncensor others, or update on a modified data set. The workflow starts with a set of top-layer commands corresponding to plot types, with ring plots, certain network plots, tried-and-true bar charts and scatterplots. To make the tax graph, I used dagre-d3, a top layer to draw directed graphs.

So we can get a lot done at the top layer in a GoG-type implementation, because a lot of the little tweaks we want to make turn out to fit this concept of applying data to an æsthetic. For those that don't, we have the ability to modify every property, attribute, tag, and style. I'm not sure if any custom-written base layer of any data viz package will ever be able to achieve that sort of generality, and the SDVW via D3 feels accordingly different from the SDVW in a system with a fixed set of middle-layer commands to tweak a plot.

As a tourist, the difficulties I had with D3 were primarily about getting to know the DOM and its many idiosyncracies. The documentation clearly expects that you already know how to manipulate objects and that the workings of the HTML histogram is more-or-less obvious to you. But this is also D3's strength. I had a mini-rant last time about the documentation of dataviz packages, which are often lacking in the item-by-item specifics one needs to make item-by-item changes to a plot. Meanwhile, web developers do nothing better than write web pages documenting web page elements, so the problem is not about finding hidden information but managing the sense of being overwhelmed. You are empowered to make exactly the visualization you want.


28 August 16.

Dataviz packages: a travelogue

link PDF version

I've toyed with base R, ggplot2, Gnuplot, matplotlib, Bokeh, D3, some D3 add-ons I've already forgotten about, some packages I tried for five minutes and walked away from. This is a post by a tourist about the problems these packages face and how they try to solve them, and why being a tourist like this is such hard going. Next episode will be entitled `D3: a travelogue', because I think D3 is worth extra attention. Keeping to my character, this series will be largely conceptual and have no visuals.

The short version of the plotting problem: so many elements, each of which needs to be tweaked. On the basic scatterplot, you've got axis labels, major and minor tic marks, scale (linear? log?), a grid (if any), the points themselves and their shape, size, positions, color, transparency, a key or legend and its content. And a title.

Meanwhile, everybody wants to be able to type plot(height, weight) and get a plot. A system that doesn't have that capability of one-command plot generation would have such a high hurdle to adoption that it would need regular exposure on the cover of the New York Times to get any traction.

The three-layer model

I think of all of these dataviz packages as three-layer systems:

Bottom layer In the end, we have to get dots on the canvas, be it a Mac/X/Windows/browser window, a png, a gif, an SVG, a LaTeX diagram, or ideally all of the above. As with any effort toward device-independence, every package author works out a set of primitives to draw lines, paint dots, write text sideways. Sometimes they let you see them.

Middle layer Step past abstract drawing and manipulate elements we commonly accept as part of a plot: axes, points, histogram bars.

Top layer Build the entire picture with one command: plot(height, weight).

The SDVW

Let the standard data viz workflow (SDVW) consist of starting with the one-line command from the top layer to generate a plot, then applying a series of middle-layer commands to adjust the defaults to something more appropriate to the situation.

I really do think this is the standard. It is generally how I see people talk about a plot (`It's basically OK, but this one thing looks off'). It is easy to understand and implement: get the basic plot up, then tick through every element that needs modification, one by one. What could go wrong?

The most common other workflow may be the exploratory workflow: add a top layer command to a script or call one from the command line, try to learn something from the plot displayed, then throw it away. When working to a tight deadline for people who think Excel charts look good, the workflow may be to call the top layer command and let all of the defaults stand.

Ease of (initial) use

Ease of initial use is when the top-layer command quickly produces something attractive out of the box, and is what is demanded by the exploratory workflow and the under-a-deadline workflow. It is also often what is demanded by people new to the package, who want to see fast positive feedback.

Ease of use is about the middle layer. No matter how stellar the defaults, your are plotting unique data, so it is a probability zero event that the defaults are exactly correct. For the same reason, the SDVW involves slightly different changes every time.

There's a neat mapping between the SDVW and ease of (initial) use: is the focus on making the top layer or the middle layer fast and easy? In an ideal world, they would both be, but our time on Earth is finite, and package authors have to lean toward one or the other.

To give an example of a system that prefers ease of use over ease of initial use and thus maps nicely to the SDVW, we have that unfashionable workhorse Gnuplot. At the top layer, start with a plot or splot command to generate the plot. At the middle layer, every component either has a set command which you can find in the awkward interactive help system, or can't be modified so stop trying. If you want to layer components, like a regression line through a scatterplot or a Joy Division plot, use replot. So the inital plot is one command, everybody thinks the defaults are ugly or broken, but there is a straightforward mapping between changes you may want to make and set commands to implement them.

Non-orthogonality

Although our ideal in most software engineering is to have orthogonal components where each tweak stands independent of every other, a plot is unified. Text labels should be close to the points they are marking, but close is defined by the context of point density, space on the page, text font. As a user, I don't even want orthogonality: if I expand the text font, I want the definition of close to self-update so nothing gets cut off.

The smarter the top layer gets, the more difficult it is to get the middle layer to behave straightforwardly, and the less likely it is that there's a Gnuplot-like set command that just makes one tweak. For example, in some systems, adding a second $y$-axis is basically orthogonal to everything else and so is trivial; in ggplot2 axes are more closely tied to everything so adding a second $y$-axis is a pain.

Documentation by example

Documentation in this genre leans toward examples of the SDVW. Here are some bl.ocks, now cut/paste/modify them to what you want. I found this example-driven form to be amazingly consistent across packages and authors in this space. It's not just the official documentation and blog entries: I have access to Safari Books self-promotion disclaimer: every time you read a page from 21st Century C there, they pay me a fraction of a cent., where I went hoping to find full expositions going beyond worked SDVW examples, but I instead found more detailed and extensive SDVWs. I'd check a question at Stack Overflow, and the answer is a complete example preceded by `here, try this.' This is in contrast to other times I've gone to Stack Overflow and get the usual paragraphs of mansplaining about a single line of code.

Example-driven documentation is a corollary to the lack of orthogonality. How does it make sense to modify the points in a scatterplot when you haven't even produced the plot? But because each SDVW is different, either the example did what you are trying to do and you've won, or it isn't and you are right where you started.

Every package provides examples of the top three or four things you can do to modify the axes, but surprisingly few take the time to provide a boring page listing all the things you can do with the axes and how, or the page is sparse and descriptions point you back to the examples. Sometimes the answer on Stack Overflow would be a one-setting tweak which is never mentioned in the official documentation. I'm harping on documentation because I found it to be the biggest indicator of whether the authors were shooting for ease of initial use or ease of use, and on a cultural note it shows a clear difference between how dataviz package authors understand their users and how general data analysis package authors see their users.

That concludes part one of my travelogue. These systems do amazing things, but this is my confession that the full SDVW is still a slog to me. The standard data viz workflow is, to the best of my understanding, standard, yet being a tourist across packages required adapting to a new idiosyncratic way of walking through the SDVW every time. This may be because of the vagaries of how different base layers were designed, the (possibly hubris-driven) sales pitch that all you need is a top-layer command and you'll never need to change anything else, or the fundamental non-orthogonality of a data visualization.


8 August 16.

Murphy bed projects

link PDF version

People in sitcoms have jobs. They have a routine that allows similar things to happen every week. If people in movies have jobs, the job is an irrelevance mentioned in passing, or they are full-time spy assassin hunters.

I want my work narrative to be about projects rather than a continuous stream of existence with no set ending. People in movies lead more interesting lives.

I've made a real effort to switch over to project-oriented thinking all the time, and it does feel better. I sit down to work, and I see a set of finite things to build, not a never-ending slog. Everything (including the admin stuff) is in its own repository to check out as needed.

Murphy bed projects

To give another metaphor: the Murphy beds you find in tiny studio apartments. Once the bed is folded down from the wall, it covers the space and there is nothing to do but be in bed; once the bed is folded back up into its closet, you don't think about the bed at all. I've had the joy of sleeping in a few, and I think they're great.

It clearly dovetails with the project-centric life. Work on the project until you hit your stopping point, then put up the Murphy bed, pack up the tool box, fold up the tent, or whatever other physical metaphor you want, and move on.




My home directory, conceptual view. via

If you leave the cat on the Murphy bed when you fold it up, you will know. The process of being ready to fold everything away forces the discipline of stopping to ask what needs cleaning away, what threads of thought are still open, and what is to be done about it all.

If I'm going to check out the project fresh every morning, I need a makefile describing every setup detail—that's a good thing. It should have a make clean target to throw out the things I know are temp files or that can be regenerated—that's useful. I'm using revision control, which tracks some files and doesn't track others, so I have to decide early whether a given file is important enough to track, and what to do about it if not—that's so much better than my old routine of getting up to my ears in semi-important files and then feeling overwhelmed and shoving them into a temp directory I never look at again.

There's a definite trend toward being able to fold even up the entire virtual computer, which is stored in The Cloud or on your repository of virtual images. Personally, I'm not there yet, because even on an æsthetic level this isn't doing much for me beyond the usual PC-as-server setup that I have. I mean, you're always going to be typing into something. Without The Cloud, I'm going to have the usual home directory and an archive of projects from which I can pull.

My home

What's in my home directory? An archive directory, a temp directory, and that's it.




My home directory, actual view

The archive directory holds all those project repositories, a library of PDFs and data sets, items from my history.

I'm trying to get rid of the temp directory but can't let go yet. I thought it was weird when Lisa Rost said she used her desktop as her temp directory, but now I see that she makes a lot of sense. At the end of the day, if there are stray temp files in my home, they are blatantly present and I want to destroy them.

As a half-digression, I've taken to keeping one (1) text file with all my little side-notes on all of my projects. The notes use Org mode, which is a common standard for writing outlines. My text editor vim, with the orgmode plugin folds the inactive outline segments out of view, so I get the same Murphy bed effect of starting with a near-blank slate, unfolding the current project's notes, and leaving everything else hidden away. This text file of side-notes has massively cut down on my count of stupid two-line text files, and also serves as my index of all in-progress projects.

So, setup is easy: when I want to work on a project, I open a new terminal (actually, I make a new work session in tmux), git clone a copy from the bare repository in the archive directory, unfold the segment in my notes, go. When I want to shut down the project for the day, rm -rf the directory, close the tmux session, fold away the notes, go make some tea.

Perhaps you tensed up as much as I do at the part where I rm -rf the directory. In fact, with version control there are several ways to lose data beyond just deleting a file.

  • Yes, I could delete an untracked file.
  • I could have a stash that gets deleted.
  • I could have diffs that I haven't committed.
  • I could have everything committed locally but not pushed to the archive.

So I wrote a script to check for all of these things. People have told me that some DVCS GUIs do some number of these things. It's turned out to be useful to have a command that I can call quickly, because I can call it all the time to check the state of things (even when I have terminal-only access).

I named the script git-isclean, and posted it on Github at that link. If it gives me green check marks, I'm done; else it will (with the -a flag) automatically help me with the next step in cleanup. It depends on the interactive status script I wrote before, because it makes perfect sense to use it; if you don't want to use that script, change the use of git istatus to git status.




Sample usage.

Given my goal of keeping my home directory empty save for the one project I am working on right now, the need for this script is obvious. But it is useful even in less sparse workflows, because it provides a little to-do list of loose ends, worth having any time.

I hope something there was useful to you or gave you some ideas. Last time I wrote a navel-gazing post about my workflow was in 2013. Maybe I'll do another one in 2019.


1 August 16.

Version control as narrative device

link PDF version

I'm a convert. I went from not getting distributed version control systems (DVCSes) to writing a book chapter on one and forcing it on my friends and coworkers. I originally thought it was about mechanics like easy merging and copying revisions, but have come to realize that its benefits are literary, not mechanical. This matters in how DVCSes are taught and understood.

The competition

One way to see this is to consider the differences among the many distributed version control systems we aren't using.

Off the top of my head, there's Git (by the guy who wrote Linux, for maintaining Linux), mercurial (adopted as the Python DVCS), fossil (by the guy who wrote SQLite, for SQLite development), bazaar (by GNU to support GNU projects). So, first, in terms of who `invented' DVCS, it seems like everybody saw the same problems with existing version control systems at about the same time, and developed responses at about the same time. All of these are big ticket projects that have a large network and geek mindshare.

But at this point, Git won the popularity contest. Why?

It's not the syntax, which is famously unclear and ad hoc. Want to set up a new branch based on your current work? You might start typing git branch, but you're better off using git checkout -b. There's a command to regraft a branch onto a different base on the parent branch, git rebase, which is of course what you would use if you want to squash two commits into one.

Did you stage a file to be committed but want to unstage it? You mean

git reset HEAD -- fileA

The git manual advises that you set up an alias for this:

git config --global alias.unstage 'reset HEAD --'

#and now just use

git unstage fileA

I find it telling that the maintainers decided to put these instructions in the manual, instead of just adding a frickin' unstage command.

Yes, there are front-ends that hide most of this, the most popular being a web site and service named github.com, but it's a clear barrier that anything one step beyond what the web interfaces do will be awkward.

Early on, git also had some unique technical annoyances—maybe you remember having to call git gc manually. It has a concept, the index that other systems don't even bother with.

All that said, it won.

The index and commit squashing

To go into further detail on the index that most DVCSes don't bother with, it is a staging area describing what your next commit will look like. Other systems simplify by just taking all changes in your working directory and bundling them into a snapshot. This is how I work about 98% of the time.

The other 2%, I have a few different things going on, say adding some text on two different parts of a document, and for whatever reason, I want to split them into two commits. The index lets me add one section, commit, then add the next and produce a second commit.

By splitting the mess of work into one item with one intent and a second with a new intent, I've imposed a narrative structure on my writing process. Each commit has some small intent attached, and I even have the ability to rearrange those intents to form a more coherent story. Did I know what I was doing and where I was going before I got there? Who knows, but it sure looks like I did in the commit log.

There's some debate about whether these rewritings are, to use a short and simple word, lying. The author of Fossil says (§3.7) that not being able to rewrite the history of the project development is a feature, because it facilitates auditing. You can find others who recommend against using git rebase for similar reasons. On the other side, there are people who insist on using git rebase, because it is rude to colleagues to not clean up after yourself. And I stumble across tutorials on how to remove accidentally-committed passwords often.

One side of the debate is generating an episodic narrative of how the project formed; the other is generating a series of news bulletins.

Personally, I am on the side of the episodic narrative writers. If you could install a keylogger that broadcasts your work, typos and all, would you do it? Would you argue that it is a matter of professional ethics that not doing so and only presenting your cleaned-up work is misleading to your coworkers?

Providing a means of squashing together the error and its fix loses the fact that the author made and caught a mistake, but that makes life more pleasant for the author and has no serious cost to others. Most information is not newsworthy.

You might draw the line in my ethical what-if at publication: once your work is out the door, correcting it should be accompanied by a statement. Git semi-enforces this by making it almost impossible to rebase after pushing to another repository. This seems to be more technical happenstance than a design decision.

On a social scale, revision control is again about building a narrative. Before, we were a bunch of kids running around the field kicking a ball around, and now we're a bunch of kids playing kickball. We don't throw tarballs or semi-functional patches at each other; we send discrete commits to discrete locations under specific rules, and use those to understand what colleagues were doing. We can read their commit history to know what they were doing, because they had tools to make the commit history a structured narrative.

I'd love to see somebody take my claim that revision control develops a narrative literally and actually use a DVCS to write literature. We are all familiar with interactive fiction; why would revision control fiction, with its built-in sense of time and branching, be so unusual?

The evangelical disconnect

This is the setup for so many awkward conversations. The evangelist has such zeal because of the structuring of unstructred work facilitated by the complicated features like the index and the rebase command; the newbie just wants to save snapshots. The evangelist may not be able to articulate his or her side, or the evangelist does talk about building a narrative, but a lot of people just do not care and see it as useless bureauracy, or don't have a concept of what it means until they are on the other side and have had a chance to try it themselves. I'd be interested to see tutorials that focus on the process of narrative writing, instead of rehashing the basics that are already covered so well.

In other evangelist news, I've saved you hours reading the million Git vs Other DVCS blog entries out there. The other one is easier to use, because it has fewer narrative control features.

If you buy my claim that Git's primary benefit over near-equivalent competitors is tools to rewrite the narrative, then Git's winning over alternatives is a strong collective revealed preference. Some of these same people who reject some programming languages because the semicolons look cluttery enthusiastically embrace a system where git symbolic-ref HEAD refs/heads/baseline is a reasonable thing to type. Can you work out what it does and why I needed it? That's a strong indication of just how high-value narrative control is among geekdom.

Of course, not everybody in a crowd is of like mind. DVCSes are socially interesting because they are something that working adults have to learn, with a real payoff after the learner gets over the hump. Like double-entry bookkeeping or playing the piano, the people who have taken the time to learn have a new means of understanding the world and generating narratives that people pre-hump may not see. It's interesting to see who has the desire to pursue that goal to its conclusion and who gives up after seeing the syntactic mess.


6 December 15.

Apophenia v1.0

link PDF version

Apophenia version 1.0 is out!

What I deem to be the official package announcement is this 3,000 word post at medium.com, which focuses much more on the why than the what or how. If you follow me on twitter then you've already seen it; otherwise, I encourage you to click through.

This post is a little more on the what. You're reading this because you're involved in statistical or scientific computing, and want to know if Apophenia is worth working with. This post is primarily a series of bullet points that basically cover background, present, and future.

A summary

Apohenia is a library of functions for data processing and modeling.

The PDF manual is over 230 pages, featuring dozens of base models and about 250 data and model manipulation functions. So if you're thinking of doing any sort of data analysis in C, there is probably already something there for you to not reinvent. You can start at the manual's Gentle Introduction page and see if anything seems useful to you.

For data processing, it is based on an apop_data structure, which is a lot like an R data frame or a Python Pandas data frame, except it brings the operations you expect to be able to do with a data set to plain C, so you have predictable syntax and minimal overhead.

For modeling, it is based on an apop_model structure, which is different from anything I've seen in any other stats package. In Stats 101, the term statistical model is synonymous with Ordinary Least Squares and its variants, but the statistical world is much wider than that, and is getting wider every year. Apophenia starts with a broad model object, of which ordinary/weighted least squares is a single instance (apop_ols).

By assiging a format for a single model structure:

  • We can give identical treatment to models across paradigms, like microsimulations, or probabilistic decision-tree models, or regressions.
  • We can have uniform functions like apop_estimate and apop_model_entropy that accommodate known models using known techniques and models not from the textbooks using computationally-intensive generic routines. Then you don't have to rewrite your code when you want to generalize from the Normal distributions you started with for convenience to something more nuanced.
  • We can write down transformations of the form f:(model, model) $\to$ model.
    • Want a mixture of an empirical distribution built from observed data (a probability mass function, PMF) and a Normal distribution estimated using that data?
      apop_model *mix = apop_model_mixture(
                                    apop_estimate(your_data, apop_pmf),
                                    apop_estimate(your_data, apop_normal)
                                  );
                
    • You want to fix a Normal$(\mu, \sigma)$ at $\sigma=1$? It's a little verbose, because we first have to set the parameters with $\mu=$NaN and $\sigma=1$, then send that to the parameter-fixing function:
      apop_model *N1 = apop_model_fix_parameters(
                                    apop_model_set_parameters(apop_normal, NAN, 1));
            
    • You want to use your mixture of a PMF+Normal as a prior to the $\mu$ in your one-parameter Normal distribution? OK, sure:
      apop_model *posterior = apop_update(more_data,
                                          .prior=mix, .likelihood=N1);
                
    • You want to modify your agent-based model via a Jacobian [apop_coordinate_transform], then truncate it to data above zero [apop_model_truncate]? Why not—once your model is in the right form, those transformations know what to do.
  • In short, we can treat models and their transformations as an algebraic system; see a paper I once wrote for details.

What v1 means

  • It means that this is reasonably reliable.
    • Can the United States Census Bureau rely on it for certain aspects of production on its largest survey (the ACS)? Yes, it can (and does).
    • Does it have a test bed that checks that for correct data-shunting and good numeric results in all sorts of situations? Yes: I could talk all day about how much the 186 scripts in the test base do.
    • Is it documented? Yes: the narrative online documentation is novella length, plus documentation for every function and model, plus the book from Princeton University Press described on the other tabs on this web site, plus the above-linked Census working paper. There's a lot to cover, but an effort has been made to cover it.
    • Are there still bugs? Absolutely, but by calling this v1, I contend that they're relatively isolated.
    • Is it idiot-proof? Nope. For example, finding the optimum in a 20-dimensional space is still a fundamentally hard problem, and the software won't stop you from doing one optimization run with default parameters and reporting the output as gospel truth. I know somebody somewhere will write me an angry letter about how software that does not produce 100% verifiably correct results is garbage; I will invite that future correspondent to stick with the apop_normal and apop_ols models, which work just fine (and the OLS estimator even checks for singular matrices). Meanwhile, it is easy to write models that don't even have proven properties such as consistency (can we prove that as draw count $\to\infty$, parameter estimate variance $\to 0$?). I am hoping that Apophenia will help a smart model author determine whether the model is or is not consistent, rather than just printing error: problem too hard and exiting.

  • It means that it does enough to be useful. A stats library will never be feature-complete, but as per the series of blog posts starting in June 2013 and, well, the majority of what I've done for the last decade, it provides real avenues for exploration and an efficient path for many of the things a professional economist/statistician faces.

  • It means I'm no longer making compatibility-breaking changes. A lot of new facilities, including the named/optional arguments setup, vtables for special handling of certain models, a decent error-handling macro, better subsetting macros, and the apop_map facilities (see previously) meant that features implemented earlier merited reconsideration, but we're through all that now.

  • It's a part of Debian! See the setup page for instructions on how to get it from the Debian Stretch repository. It got there via a ton of testing (and a lot of help from Jerome Benoit on the Debian Scientific team), so we know it runs on a lot more than just my own personal box.

From here, the code base is in a good position to evolve:

  • The core is designed to facilitate incremental improvements: we can add a new model, or a new optimization method, or another means of estimating the variance of an existing model, or make the K-L divergence function smarter, or add a new option to an existing function, and we've made that one corner of the system better without requiring other changes or work by the user. The intent is that from here on out, every time the user downloads a new version of Apophenia, the interface stays the same but that the results get better and are delivered faster, and new models and options appear.

  • That means there are a lot of avenues for you and/or your students to contribute.

  • Did I mention that you'll find bugs? Report them and we'll still fix them.

  • It's safe to write wrappers around the core. I wrote an entire textbook to combat the perception that C is a scary monster, but if the user doesn't come to the 19,000-line mountain of code that is Apophenia, we've got to bring the mountain to the user.

By the way, Apophenia is free, both as in beer and as in speech. I forget to mention this because it is so obvious to me that software—especially in a research context—should be free, but there are people for whom this isn't so obvious, so there you have it.

A request

I haven't done much to promote Apophenia. A friend who got an MFA from an art school says that she had a teacher who pushed that you should spend 50% of your time producing art, and 50% of your time selling your art.

I know I'm behind on the promotion, so, please: blog it, tweet it, post it on Instagram, make a wikipage for it, invite me to give talks at your department. People will always reinvent already-extant code, but they should at least know that they're doing so.

And my final request: try it! Apophenia doesn't look like every other stats package, and may require seeing modeling from a different perspective, but that just may prove to be a good thing.