Modeling with Data

24 July 17.

Pipe-delimited format

The norm in tabular text data is comma-separated values (CSV), where each row in a table is on a single line, and each value is delimited by commas.

This is not a good idea. Use pipes.

Not commas

Commas appear everywhere. In some locales, the comma is the normal decimal separator: 3,1415 and not 3.1415. So this is already US-centric.

You may know some people with commas in their names, like Pradeep Smith, Jr. No problem: we'll protect that with something like 'Pradeep Smith, Jr'. Now that we have this accommodation, what are we going to do about the street honorifically named after a DC go-go legend, Li'l Benny Way, NW? Now we have to provide an accommodation for the accommodation, like 'Li\'l Benny Way, NW', and now that that's in place, another policy on accommodating backslashes in the input data.

Will Excel, STATA, et al., read this correctly? Maybe. Should backslashes be treated differently inside a pair of ' ' and outside? Is text allowed to be free of " " constraints? There are a lot of reasonable answers to such questions, each incompatible with the next.

Pipes

Nobody on Earth uses pipes in names or numbers. In the U.S., where the government won't even legally grant María as a name, pipes aren't coming any time soon.

I recently cleaned up my mp3 collection (prepping for a playlist-making toy I wrote) and found basically every Unicode character represented somewhere in the names—mp3 file naming is still the metaphorical Wild West. Any combination of ' 's and " "s broke on somthing. There was even one single album in the pantheon of music that had filenames with pipes Kendrick Lamar's untitled unmastered., which, as you can tell by the period in the album name, was going out of its way to be typographically complicated.

So we still need some plan for the day when our data has a pipe in it. This works:

Read a field until you hit a pipe, then start the next field.
A newline ends the observation.
If the pipe or newline is preceded by a backslash, then treat it as part of the input data and not a delimiter.

OK, you're done. You don't even need a backslash in front of backslashes: if the next character is anything but a pipe or a newline, it must not be special. For most data sets, the third step doesn't even come into play.

Also, pipes look nice: with consistent-width data, you see neat columns, which is why they're a nicer delimiter than other infrequently-used charactes like ~ or, I dunno, }.

So the pipe-delimited format (PDF?) makes a lot more sense than the comma, and in the social context of dealing with disparate programs written by disparate authors, is more likely to work everywhere.

You could implement a similar algorithm with commas: let commas be the delimiter, when the input text has commas put a backslash in front of them, and don't make any other changes to the input data. This is not the norm in CSV files; we can only speculate why.

Social

The big problem is that the default in this world is still commas. If you click open in Microsoft Excel, you'll see your file read as comma-delimited UTF-16, which is a correct assumption, to a first approximation, never. UTF-8 update: it's now 89.5% of the web; UTF-16 is under 0.1%. With a minute of clicking on dialogue boxes every time you open a text file, you'll eventually get where you want to go—with slightly less frequent clicking if you change the system-wide list delimiter in the Windows control panel(!). Other systems also impose sometimes significant work on people who want to use anything but a comma as a delimiter.

If you're a data producer, think pipes. The very rare incidence of pipes in human-language data or numbers makes almost every text-recording problem evaporate. If you're the author of things that read data, please bear in mind that nobody really wants commas to be the delimiter—they've just always been there—and it should be easy for data users to specify the rules that make sense for their data.

———————

PS: As an experiment, I'm outsourcing commenting to Twitter. Here's the tweet announcing this blog entry; click the date stamp to see the thread and leave comments and replies. If you don't have a Twitter account and post a reply anywhwere else on the Web, please notify me and I'll tweet about your post.

In which I praise and endorse my favorite file format: https://t.co/6P6N1RAqWA
— Ben Klemens (@b__k) 24 de julio de 2017

9 June 17.

Why I don't call software 'technology'

link PDF version

Though I work in policy, and used to be the tech policy person for a notable think tank, I try to not talk about politics on this site—I have Bureauphile for that. But this is my 200th post on this blog, so it's like my birthday and I'll post what I feel like posting about.

Here's an amusing anomaly that has grown popular: software people have started calling the software community the tech community, and referring to things like Java or structured query language as Java technology or SQL technology.

We might just take this as puffery, like how people used to add engineer to their job title, leading to the joke about trash collectors being sanitary engineers. I suppose this is socially harmless.

But there are serious policy implications.

Patents

Until a few days ago, the Patent Office was run by the former Vice President for Intellectual Property at Google, Michelle Lee. She replaced the former Vice President for Intellectual Property at IBM, Dave Kappos. To the best of my knowledge, both of these people support a technical effect standard for patenting of intangibles like software or medical facts.

What is a technical effect? Who knows. A technical effect standard has never existed under US patent law, but as things like medical tests and software were ruled to be unpatentable because they were too close to unpatentable abstractions like facts about biology and pure math, people started talking about the technical effect standard as an effort to get more things classed as an invention instead of a discovery.

One proposed manifestation of the technical effect rule would be that software per se should not be patentable, but an algorithm that improves the functioning of a computer should be.

I see that face you're making, and I agree—the distinction, if any, is too hair-splitting to be tenable in the real world. The intent is that an algorithm to schedule pet food deliveries online should not be patentable but an algorithm to select schedules for optimization via simulated annealing should be, because the second is just so much more...technical.

This is a distinction made of pure squish, and lawyers who don't know how to write code might see something there. But I expect that if you're reading this blog you can easily think of situations from your own work where you solved serious technical problems in assembling the pet food delivery systems in your life. You can probably also think of Greek-laden math you could do to wow friends and judges but which you know to be a little trivial.

But under a technical effect rule, language clearly matters. If I were in a team of lawyers working hard to change the opinions of judges by any means possible, so we could make our software patentable under a technical effect rule, I would be sending out memos all day long: from now on, we will refer to all software as “technology.'' Remember, technology is patentable but software isn't.

Tax and funding

As a general rule, things associated with technological development will always get a better deal from the government—who doesn't like technological advancement? So there are some carve-outs in tax law for different kinds of intellectual property. Should software be included?

There are research credits and a preferential expensing scheme for research costs. These are broadly defined, and would include even the development of a pet food delivery system. So those seem to cover software whether it is technology or mere development work. But there are other points of open debate.

In the 2013--2014 Congressional session, Congressperson Schwartz (D--PA) introduced a bill proposing a patent box that would tax income associated with patents at a 10% rate, rather than the typical 35% corporate rate. This bill was entitled “The Innovation Promotion Act of 2015'' just to drive in how this is ostensibly about revolutionary new technology.

The Boustany-Neal proposal (PDF) had a broader intellectual property box, which included in its list “Any program designed to cause a computer to perform a desired function.'' This was a win for a lobbyist somewhere, because if software is largely not patentable in the present day, the Schwartz proposal doesn't give Microsoft et al the tax break they would get in a bill that explicitly includes software.

I won't belabor the analysis of the IP box proposals here—I wrote a 45-page working paper to do that. But setting aside whether it's a good idea or not, these bills already show that the decision of whether software is innovative technology or a mere clerical function has a real policy effect, the difference between a 35% and 10% tax on domestic revenue.

Maybe you want software to be patentable; maybe you are excited by the thought that Apple will finally get the domestic tax break it deserves. Maybe you work in any field but software, or lean toward free and open source code, and wonder why government should give for-profit software vendors a financial leg up over your work, and potentially the right to sue you. Maybe you want Google to pay something more than a 17% effective tax rate in the USA and 0.16% in the EU!.

This gentleman can dictate to you that data is not a Latin plural, but I can't dictate to you whether to call a repeating block of code for loop technology. I can point out that there are constant efforts to push software into a box where firms can package it into patents and are given credits and discounts for doing so. There's no grand conspiracy (as far as I can tell), but this little quirk of calling software tech directly supports those efforts, which is worth being aware of should you choose to use it.

24 May 17.

Git subtrees

link PDF version

This continues the last entry about submodules. The two setups are not interchangeable: in the setup there, the repositories are tracked entirely separately, and you won't be able to read changes in the submodule in the parent's commit history. In the subtree setup here, your changes will appear in the parent's history, but we'll have a script that extricates the subtree-only changes should we need to push them back to the source of this subpart.

Prefixes

Here's another feature of git you might not have known about: git read-tree. It is normally used in the background as part of the merge process to read the tree of files in a commit and merge them with what's in your directory at the moment. But the cool part is that you can specify a --prefix giving the name of a subdirectory to read into. If you have a side-branch in your extant repository that you want to compare with what you have checked out right now, you could use mkdir tmp; git read-tree otherbranch --prefix=tmp -u to put the contents of otherbranch in a subdirectory for your side-by-side comparison to the main project. Without the -u, the subtree is put in the index but not the working dir, which makes sense for the original purpose of merging trees; with the -u it appears in the working directory as well. If you just want the tree in the working directory, do a git reset to clear out the index after the read-tree step.

So you have another repository elsewhere. You could make it a side-branch by adding a remote (git add remotesub http://xxx), and then git fetching all the remote branches. Unlike typical branches, these fetched branches probably have nothing in common with master in your main project beyond the initial null commit. But now that the subproject lives in a side-branch, you can git read-tree --prefix=... the files into a subdirectory as above, and after progressing, merge changes from the subtree back to the side-branch. So the three-part flow of data is subtree <—> branch <—> origin repository.

You'll see the --prefix option in a lot of manual pages, if you look for it. For merge, specify the subtree via a subtree merge strategy, like git merge -s subtree=thatsubdir mergetarget, though calling it a merge strategy seems like a misnomer to me.

I got this from the git book, which takes you through the details and flags.

With this prefix business, your checkout is recorded by the parent repository itself—there's no subsidiary .git directory. Check-ins will continue to affect the parent, unless you want to get really fancy about only pushing some check-ins to the side-branch and removing them from the parent's history.

Where's the metadata telling you that you built the subtree from a side-branch? It isn't there. If your side-branch is pulled from a remote origin, it knows where that is, as usual. But when you use --prefix to dump the side-branch's contents to a subdirectory, no annotation is made that this subdirectory is in any way special, and it's up to you to remember where it came from. Of course, you could make another script or makefile target to pull and push between the subdirectory and the repository branch.

git subtree

On to the git subtree command. It does all of the above, with a little more intelligence. When you pull a distant repository into a subdirectory, all the commits are replayed as part of the main repository, as if you had been working in that subdirectory all along. You will check in new commits to the parent repository, as a unified whole. But if you want to push back to the subproject's origin repository, git subtree will go commit by commit and push only the subtree-relevant parts of each (skipping those commits that didn't touch the subtree). This is a nice bit of work none of us want to replicate, but at the core of the script, it's using git read-tree --prefix=... and git merge -s subtree=..., just as we did manually above.

Where's the metadata? Some of it is in the log. Here's the machine-oriented log entry I got when I first added a subtree to one of my projects via git subtree add ...:

Add 'pbox_proof/' from commit '7d2895ab...7f2a5e76'

    git-subtree-dir: pbox_proof
    git-subtree-mainline: 14e46e69...1037f3
    git-subtree-split: 7d2895ab0...77f2a5e76

The mainline and split references will be used by the subtree script internals, but you can see the name of the directory where the sub lives (twice, even), and you can easily grep for it. If you're going to be a heavy subtree user, the Internet recommends adding this to your .gitconfig in your home directory:

[alias]
    ls-subtrees = !"git log | grep git-subtree-dir | awk '{ print $2 }'"

With this, git ls-subtrees finds each marker in the log and prints the name. Of course, I have to know to do this, and if you gave this repository to a colleague, they will too. Also, we're missing the metadata about where the origin is that we cloned from, which you'll again have to have in a makefile or readme (or you can clone to a side-branch and then push/pull from the side branch as above, but that seems like more effort than it's worth).

For all the details, the usual man page at man git-subtree can give you the switches for add, merge, push, pull. Split is cool, but I take it to be internals for merge/push/pull.

If we weren't the sort of people who use git, both the subtree and submodule setups would seem to be sufficient: you have all the tools you need to treat a subdirectory as a repository, either with its own .git branch or as a part of the parent's commit history, that could in either case be pushed to its origin as needed. But you have to know that the subtree is special and communicate how to handle it to colleagues, human to human. This sort of un-automation is incongruous to how we, the sort of people who take the trouble to read blogs about git, want things to work. We can partially solve the problem with post-tests like the git isclean script I mentioned last time or pre-commit hooks, but the efficient solution to the many problems discussed above and elsewhere may be to just add a readme file.

The demo script

Similar to last time, here's a demo script to cut and paste (probably section by section) to your command line. It builds a parent and sub repository, grafts the sub into the parent as a subtree, and pushes some changes made in the consolidated project back to the sub. Most of it is setup, identical to last time; the subtree action happens after the commit with the message "Set up parent".

mkdir tree_demo  #everything will be in here. Clean up with rm -rf tree_demo
cd tree_demo

# Some admin junk; please ignore
bold_cyan="\033[1;36m"
no_color="\033[0m"
Divider="${bold_cyan}―――――${no_color}"
alias Print='echo -e \\n$Divider $*'
# Thanks for your patience

Print "This script demonstrates using git subtree to transfer some changes around
* create submod and larger_work repositories. This matches the setup from last time with submodules.
* add the sub as a subtree to larger_work
* make changes to the subtree
* push those to the subtree repository"
Print

Print Generate a repository that will be the subtree, with two files.

mkdir submod; cd submod
git init
echo "This repository provides a system to analyze the contents of a directory on a POSIX filesystem" > Readme
echo "ls" > directory_analyze
chmod +x directory_analyze

git add *
git commit -a -m "Set up submodule"

# Others can't push to a branch you have checked out, so switch to a fake branch
git checkout -b xx

Print Now set up the parent module. No subs yet.
cd ..
mkdir larger_work; cd larger_work
git init
echo "Find information about a file. Usage: info yr_file" > Readme
echo 'fs_analyze/directory_analyze | grep $1' > info
chmod +x info

git add *
git commit -a -m "Set up parent"

git subtree add -P fs_analyze ../submod master

Print Here is the log entry about the addition:
git log |head

Print Modify both the parent and child here; commit

echo "ls -l" > fs_analyze/directory_analyze
echo "Find information about a file. Usage: info yr_file." > Readme

Print Here are the changes to commit:
git diff
git commit -a -m "Changes made"

Print Push to the original submodule repository
git subtree push -P fs_analyze ../submod master

Print "Go to the sub; see what change(s) got recorded in the sub"
cd ../submod
git checkout master
git diff HEAD~

23 May 17.

Git submodules

link PDF version

So you have two repositories and you want to treat one as a subpart of the other. Maybe you've segregated your project into the general-use library to distribute widely and the specific case that nobody else will care about. Maybe you have one person in your group who, for whatever reason, you want focused on only one subtask. Or your organization deems part of your project to be sensitive but you really want to work from home on the rest of it. I had one big project a few months ago, and thought it would be less unruly to start work on a new part in a separate repository—but then I had the problem of merging that part into the main while retaining the revision history.

By the way, the world (well, 11 people) wanted me to write this:

After many Internet-enabled false starts, I finally have a decent understanding of how to work with subtrees in git. Should I blog this?
— Ben Klemens (@b__k) 7 de mayo de 2017

I found the abundance of the Web to be confusing for two reasons. First, there are ways to do it that are deemed archaic, whose instructions are still online, in reputable sources which haven't been updated yet. Second, there are two distinct threads, using different methods: do you want the subtree to remain its own repository with its own revision control, or do you want it to be tracked by the parent directory? The second option will be covered next time; here's discussion of the first.

Separate subdirectory

As a word of background, remember that which git repository you're in is determined by the first directory (self, parent, grandparent, ...) that has a hidden .git directory holding all the meta-junk. So if you have one git-controlled directory, and you make a subdirectory subtree and run git init in that subdirectory, then everything you do in subtree is tracked by its own git machinery, not that of the parent directory.

So here's the easiest way to turn a distinct repository into a submodule of your parent repository: leave a note in your parent repository, Hey, thanks for cloning my repository. Now go run git clone http://XXX here in the parent repository, and check out revision abc123. Thx. I would formalize this note into a makefile; here's an example with discussion to follow:

Repository=http://example.com/subrepo.git
Version=abc123

run: sub/somefile.c
	[compilation instructions here]
	[now run]

sub/somefile.c:
	make init

init:
	git clone $(Repository) sub
	cd sub; git checkout $(Version)

The init target clones the subrepository and checks out the appropriate version. Because the run target depends on a file in the sub, make init gets run the first time anybody tries to make run. The cloning gives a subdirectory with its own .git machinery, so we've basically achieved the goal.

If you're not a makefile user, you surely have your own way of running scripts. It would be nice to have the scripts run automatically; hooks don't really do it because you can't check a hook into the repository such that it'll run when you clone the repository. We'll get another partial automation below.

How would you maintain this? If you need exactly revision abc123, and are never going to modify the subtree, then you have few needs and few problems. If you are OK tracking a branch like master, whatever state it is in, you have even fewer needs.

But if you are modifying and checking in the sub, now you have two repositories to worry about. First, when you make a change in the sub, you may forget that it's a sub, and commit changes in the parent. You could maybe install a hook to check in the sub when you check in the parent (and you could maybe have make init install that hook). I wrote a script to check whether a repository is clean (everything checked in, no detached head, no stashes, ...), which goes into submodules and checks their status as well; see this previous post.

Second, once you've checked in the sub, do you need to change the ID in the makefile (or other script) from Version=abc123 to Version=def456? If so, then now you have two check-ins to make: one for the sub you modified, then one to check in the revised makefile.

When you push to the origin repositories after updating the sub and parent's makefile, you'll have to do it in two separate steps. When your colleagues pull the sub, they may have to make sure that they are pulling the right version, or the parent has some way of correctly updating the sub after it gets pulled. If you pushed the parent, now referring to def456, and forgot to push that commit to your group's shared repository for the sub, your colleagues are going to to find you and talk to you.

submodule

Moving on to formal git tools, there's git submodule. It's evidently an emulation of a feature of Subversion. One of the design principles for git was to do the opposite of whatever Subversion did, and it looks like this time they went with what Subversion did and people hated it.

You call the command with the name of a repository and a subdirectory to clone it into, and git clones the repository, and stores the metadata, about the origin repository and which subdirectory has the sub, in a file named .gitmodules in the parent's base directory. This file is part of the repository (after you commit), and when you clone a copy with a .gitmodule file, you can run git submodule update --init to check out the sub as per the stored metadata.

But unlike the makefile, that .gitmodules directory only has the paths, not the commit you've checked out. The commit ID, like abc123, is stored in the metadata of .git/modules/.... That the commit ID is internally stored instead of transparently held in a makefile or .gitmodules has pros and cons: you can't directly modify it, you can't set it to an exotic alternative to a commit ID (like, say, master), but the submodule system knows when the sub changes commit names, and you can just run git commit in the parent to update the internal annotation of the current sub.

As a digression, let me express one frustration from the submodule suite: git submodules update will “update the registered submodules to match what the superproject expects by cloning missing submodules and updating the working tree of the submodules.'' So it feels to me like it should've been named git submodule reset. If you think it updates the metadata the parent has, you're wrong (use git commit in the parent); if you think it takes an existing sub and updates it, you're wrong, because it ditches your last commit and pulls the one the parent knows. If you make this mistake (like I did a hundred times), use git reflog to look up the ID you need to get back to.

I gave you the makefile example first because I think it's a useful mental model for how the two independent repositories will work, with a parent that tracks an independent sub using only metadata, which has to be kept up-to-date because there is a certain commit name the parent is tracking that needs to be updated, in plain text for the makefile approach and in the tracked metadata for the submodule system. The submodule version hides this behind some git chrome and a few commands that save you the trouble of typing mkdir and git clone and so on, but the model is similar enough that the pitfalls are similar: after you modify the sub, the parent is pointing to the wrong commit ID for the sub until you update/commit the parent, and it's up to you to make sure that colleagues can achieve the same sync.

It seems this is as good as it's going to get given the goods git gave us. We can leave some metadata notes in the parent, but it's still up to you and your colleagues to check that everything is in sync. We hate it when something isn't fully automated, but there it is.

Next time, I'll cover the case where you want the parent's commit history to include all the changes in the sub.

But meanwhile, here is a demo script for you to cut/paste onto the command line (probably piece by piece) to see how some work with these things might go. Or, this post has further tricks and syntax notes—scroll down past the pages of caveats to get to the part where the author describes the commands and options.

mkdir mod_demo  #everything will be in here. Clean up with rm -rf mod_demo
cd mod_demo

# Some admin junk; please ignore
bold_cyan="\033[1;36m"
no_color="\033[0m"
Divider="${bold_cyan}―――――${no_color}"
alias Print='echo -e \\n$Divider $*'
# Thanks for your patience

Print "This script demonstrates using git submodule to transfer some changes around
* create submod and larger_work repositories
* add the sub as a submodule to larger_work
* clone larger_work to lw2
* make changes to the sub in lw2, push those to the submod repository
* go back to larger_work and try to recover those changes from lw2."
Print

Print Generate a repository that will be the submodule, with two files.

mkdir submod; cd submod
git init
echo "This repository provides a system to analyze the contents of a directory on a POSIX filesystem" > Readme
echo "ls" > directory_analyze
chmod +x directory_analyze

git add *
git commit -a -m "Set up submodule"

# Others can't push to a branch you have checked out, so switch to a fake branch
git checkout -b xx

Print Now set up the parent module. No subs yet.
cd ..
mkdir larger_work; cd larger_work
git init
echo "Find information about a file. Usage: info yr_file" > Readme
echo 'fs_analyze/directory_analyze | grep $1' > info
chmod +x info

git add *
git commit -a -m "Set up parent"

Print "After 'git sumbodule add', .gitmodules now exists, and has:"
git submodule add ../submod fs_analyze -b master 
cat .gitmodules

Print "But its status is tracked as new, but not yet committed:"
git status

Print "So, commit it."
git add .gitmodules; git commit -m "add .gitmodules"
git status

Print "Let's clone the parent and see what shows up"
cd ..
git clone larger_work lw2
cd lw2; ls -a

Print "ls fs_analyze turns up a blank..."
cd fs_analyze; ls

Print "...until we init/update."
git submodule update --init
ls -a

Print "Detached HEAD."
git branch
Print "Change to master"
git checkout master
git branch

Print Make changes and push to origin
echo "ls -l" > directory_analyze
git commit -a -m "More powerful directory analysis technology"
git push origin master

Print The parent knows the sub has changed commits
cd ..
git status

Print Now tell the parent that the sub is on a new commit
git commit -a -m "Updated sub"
git submodule status

Print Great, our cloned sub has moved forward. Back to the first, pull the changes:
cd ../larger_work
git remote add lw2 ../lw2
git pull lw2 master
git submodule status

Print No submodule movement. Do it manually via git pull:
cd fs_analyze
git checkout master
git pull
cd ..
git commit -a -m "update recorded sub commit"
git submodule status

23 April 17.

Platforms for reliable research

link PDF version

This stems from a Twitter thread about how you can't do replicable science if some package won't be installable five years from now, which is not an unusual case for Python packages. Click on the lower date stamp for more (including another view of the communities I describe here):

@khinsen python is a terrible choice if you need reproducibility in time. I am not sure any technology < 20 years old is better
— David Cournapeau (@cournape) April 6, 2017

Meanwhile, it was observed, C code from the 1980s routinely runs just fine. For example, I work with some large-scale examples that I could expound about. Or, when you type grep or ls, you are calling up a C program whose core was likely written in the 1980s or early 1990s. Unless it depends on specific hardware features, I think it's hard to find correctly written C code from the 1990s that doesn't compile and run today.

I've been using R since 2000, and a lot of my code from even five years ago won't run. We had a project which repeatedly hit problems where package A needed version 2.3.1 of the Matrix package, but not version 2.3.2 or later, but package B relies on version 2.4 of the same package. In the end, it killed our project: every time we had to re-tweak the install, our potential users perceived the project as a little less reliable.

So I honestly believe it's true that C code is more reliable and more likely to run a decade from now, and there's a wash of evidence to demonstrate this. Then, why?

I think there are two reasons: there are cultural differences in a language that has a standard, and the lack of a standard package manager(s) has been an inadvertent blessing here. These features have led to a social system that allows us to build technical systems that don't depend on people.

There's a standard

An interpreted language doesn't need a standard because if the interpreter got the right answer, then you've conformed 100%. In the long term, this is terrible. Raymond Chen, a kernel-ready programmer at Microsoft, has a blog that is filled with convoluted stories like this one, whose template is that authors of Windows programs found an undocumented/underdocumented part in the workings of Windows, then years later the Windows OS people try to change that part, while fully conforming to documented behavior, then get a million complaints as the undocumented trick breaks. That link is the first example I grabbed, but keep skimming; he's been blogging these stories for a decade or so.

This state is hard to picture in the world of plain C, because having a standard affects culture. I can't count how many people have written to tell me about some piece of code I put out in the public sphere that might compile on every modern compiler, but does not—or even may not—conform to the ISO standard. Not writing to the contract is a bug, whether it works for me or not.

Meanwhile, in the world of R or Python, the documentation could be seen as the contract. But do you really think every single R function documents its behavior when it receives a NaN or -Infinity argument? And with no culture of strict contractual obligation, there is little that prevents the authors from just rewriting the contract. The same holds in greater force with third-party packages, which are sometimes developed by people who have no experience writing contracts-in-documentation so that they are reasonably future-proof.

I liked this Twitter exchange. Each tweet is another variant of the statement that the R community has no coherent contract or contract conformance culture. Again, click the date stamp for the two replies.

So, R diatribe time: Tim Smith's traditional rant is that R contains "to a first-order approximation, zero software engineers".
— Oliver Keyes (@kopshtik) March 24, 2015

So, contracts exist in C-world and elsewhere, but from everything I have seen, the cultural norms to take those contracts seriously are far stronger among C authors.

Towers

Azer Koçulu has a Javascript package named Left Pad which provides a function to pad a string or number with white space or zeros. That's the whole package: one function, to do something useful that I wouldn't want to get side-tracked into rewriting and testing, but which is not far from a Javascript 101 exercise.

It was a heavily-used micro-package, and not just by fans of left padding, as a data analysis package might use a table-making package, which would depend on Left Pad. When Mr Koçulu unpublished all his Node Package Manager submissions, this broke everything.

In practice, R packages tend to have several such microdependencies, while the authors of C packages tend to cut something like a left pad function from the library code base and paste it in to the code base of the project at hand. Authors of the GSL made an effort to write its functions so they could be cut and pasted. SQLite has a version in a single file, whose purpose is to allow authors to copy the file into their code base rather than call an external SQLite library.

I think it is the presence or lack of a standard package manager that led to this difference in culture and behavior. If you have the power to assume users can download any function, so seamlessly it may even happen without their knowledge, why wouldn't you use it all the time?

The logical conclusion of having a single, unified package manager is a tower of left-pad-like dependencies, where A depends on B and C, which depends on D and E, which depends on B as well.

The tower is a more fragile structure than just having four independent package dependencies. What happens when the author of B changes the interface a little, because contracts can change? The author of package A may be able to update to B version 2, but E depends on B version 1. Can you find the author of E, and convince him or her to update to B version 2, and to get it re-posted on CRAN so you can use it in a timely manner?

To get the package on to CRAN, the author of E will have to pass the tests he or she wrote on every machine R runs on. Douglas Bates threw up his hands over the problems this engenders:

I am the maintainer of the RcppEigen package which apparently also makes me the maintainer of an Eigen port to Solaris. When compilers on Solaris report errors in code from Eigen I am supposed to fix them. This is difficult in that I don't have access to any computers running Solaris .... So I have reached the point of saying "goodbye" to R, Rcpp and lme4 ...

It's good that CRAN wants to provide a guarantee that a package will compile on a given set of platforms. First, it does this because there is no culture of contracts, so the only way to know if an R/C++ package will compile on Solaris is to try it. Second, the guarantee is somewhat limited. Did the package get the right answers on all systems? If it had numeric tests, those tests passed, but the CRAN testing server knows nothing of how to test eigenvector calculations. I've bristled at more than enough people who have told me they trust the numbers a package spits out because it's on CRAN so it must have been tested or peer-reviewed.

In the short run, this is great. We've solved a lot of dependency-mismatch problems.

In the long run, we haven't solved anything. You come back in three years to re-run your analysis, and you find that package B was removed, because the author was unable to reply to requests to fix new issues, because she died. Or the latest version of package B is impossibly different, so you try the old one, and you find that none of the guarantees hold: to the best of my knowledge, CRAN doesn't back-test old versions of a package (what would they do if one broke?), so whether your old version of package B works with the 2020 edition of R and its ecosystem is as much a crap shoot as it ever was.

That was a lot of kvetching about package manager issues, but with the intent of comparing to C-world.

With no official package system, your C library has no means of automatically downloading dependencies, so forget about building a tower where you instruct readers to chase down a sequence of elements. Users will tolerate about two steps, and most won't even go that far. For code reuse, this has not been good, and we see people re-inventing well-established parts over and over again. But for reliability ten years down the road, we get a lemons-out-of-lemonade benefit that building unreliable towers of dependencies is impracticable.

GUI-based programs sometimes depend on a tower of dependencies, and a package manager like Red Hat's RPM or Debian's APT to install them. Such programs also typically have command-line versions just in case.

With no package manager, you have to give users the entire source code and let them rebuild. The build tools are standard as well. The norm is the makefile, which is defined in the POSIX standard. If you're using Autoconf, it will run a POSIX-compliant shell script to produce such a standard makefile.

There is nobody to check whether your library/package will build on other peoples' machines, which means that you have to stick to the contract. On the one hand, people screw this up, especially when they have only one compiler on hand to test against. On the other, if they get it right, then you are guaranteed that today's version of the code will compile in a decade.

I'm no Luddite: it's great that package managers exist, and it's great that I can write one line into the metadata to guarantee that all users on an Internet-enable computer e.g., who don't work on a secured system with an appropriate configuration can pull the dependencies in a second. It's great that there are systems monitoring basic package sanity.

But for the purposes of long-term research replicability, all of these boons are patches over the lack of a system of contracts or a culture that cares deeply about those contracts. Think back to how often I've mentioned people and social systems in my discussion of interpreters and package managers. Conversely, once you've zipped up code written to contract and a makefile-generator written to contract, you're irrelevant, your wishes are irrelevant (assuming the right licenses), and there is no platform that can deliberately or due to failure prevent a user from installing your package. People and their emotions and evolving preferences are out of the picture. In the end, that's really the secret to making sure that a platform can be used to generate code that is replicable years and decades from now.

29 August 16.

D3: a travelogue

link PDF version

Here's my largest project in D3: a visual calculator for your U.S. 2015 taxes. It doesn't look much like a scatterplot, but it uses a tool, Data Driven Documents (D3), used extensively for traditional dots-on-a-grid plots. I'm going to discuss the structure of this system, in the context of the last entry, which went over the parts of the standard data viz workflow (SDVW). If you've never used D3 and the acronym DOM doesn't have meaning to you, reading this may soften the initial blow of trying to use it.

The DOM

Document Object Model. The document on your web page screen is a parent object with a well-defined set of child objects, akin to standard programming language structs that hold other structs, or XML documents whose elements hold other XML elements. Each struct holds elements of any sort: scalars, arrays, sub-objects, functions. Being a simple implementation of the struct with sub-structs, the DOM as a whole is a work of clean generality.

The DOM is a work of massive hyper-specificity, as each type of object—header, canvas, Scalable Vector Graphic, rectangle, text, whatever—has associated its own set of special properties which your browser or other reader will use to render or act upon the object. Your browser's developer tools will show you the full tree and associated properties. Further, there is no fixed list of what those properties are. If you want to add falafelness, and then asssign a javascript function to the object's deepFry property to modify the object's CSS styles based on its falafelness, all that is entirely valid. This right to assign arbitrary magic words is also open to the author of any library, D3 included.

Further, there are the quirks of history, as these neat objects represent web pages which include different HTML markers like IDs and spans, which tie in to cascading style sheets (CSS). Add it all together and any one object has attributes, tags, styles, properties, content, events. Changing an object is almost always a straightforward tweak—once you win the seek-and-find of determining which attribute, tag, style, property, content, or event to modify.

The HTML Histogram

Stepping back from the potential for massive complexity, here's is a simple demo of a horizontal-bar histogram. The table has four rows, each with an object of class bar, where the style characteristics of any bar object are listed in the header. Then in the table, the width style is set for each individual bar, like style="width:200px;".

<html>
<head><style>
.bar {height: 10px; border: 2px solid;  color: #2E9AFE;}
</style></head>

<body>
<table>
<tr><td>Joe</td> <td><div class="bar"    style="width: 80px;"/></td></tr>
<tr><td>Jane</td> <td><div class="bar"   style="width:300px;"/></td></tr>
<tr><td>Jerome</td> <td><div class="bar" style="width:300px;"/></td></tr>
<tr><td>Janet</td> <td><div class="bar"  style="width:126px;"/></td></tr>
</table>
</body>
</html>

Joe
Jane
Jerome
Janet

Using any programming language at all, you could write a loop to produce a row of this table for each observation in your data set (plus the requisite HTML header and footer). Using only HTML and CSS, you've generated an OK data visualization.

You've probably already started on the SDVW in your head, and are thinking that the spacing is too big, or the shade of blue is boring, or the label fonts are wrong. And, of course, everything you would need to make those changes is in the DOM somewhere. Want the bars to go up instead of rightward? Set the height style on a per-bar basis instead of the width (and rearrange the table...). Want to give the girls pink bars? Neither I nor the HTML standard can stop you from setting color conditional on another column in the data set.

The Grammar of Graphics

The GoG is a book by Leland Wilkinson, subsequently implemented as various pieces of software, the most popular of which seems to be ggplot2 for R. I have no idea whether the HTML histogram or the GoG came first, but their core concept is the same: the objects on the screen—one per observation—have a set of characteristics (herein æsthetics), and we should be able to vary any of them based on the data. For example, maybe box height represents an observed value and color represents a statistical confidence measure.

Gnuplot and earlier plotting programs don't think of plots as objects with æsthetics. The æsthetic built in to the top-level plot command is (X, Y, Z) position, and others can be linked to data if there is a middle-layer function that was written to do so. ggplot2 is much more flexible, and every(?) æsthetic is set via the top-level command, but each geometry still has a fixed list of æsthetics. After all, if you want to do something unusual like change the axes' thickness based on data, somebody had to write that up using R's base-layer graphics capabilities.

Applying the GoG principle, setting object æsthetics using data, to the object properties of the DOM is natural to the point of being obvious, as per the HTML histogram. D3 just streamlines the process. You provide a data set, and it generates the right number of objects in a scatterplot, or bars, or graph, and applies your æsthetic rules to each. It provides an event loop so that the points can be redrawn on demand, so a button can switch how the drawing is done, censor some points and uncensor others, or update on a modified data set. The workflow starts with a set of top-layer commands corresponding to plot types, with ring plots, certain network plots, tried-and-true bar charts and scatterplots. To make the tax graph, I used dagre-d3, a top layer to draw directed graphs.

So we can get a lot done at the top layer in a GoG-type implementation, because a lot of the little tweaks we want to make turn out to fit this concept of applying data to an æsthetic. For those that don't, we have the ability to modify every property, attribute, tag, and style. I'm not sure if any custom-written base layer of any data viz package will ever be able to achieve that sort of generality, and the SDVW via D3 feels accordingly different from the SDVW in a system with a fixed set of middle-layer commands to tweak a plot.

As a tourist, the difficulties I had with D3 were primarily about getting to know the DOM and its many idiosyncracies. The documentation clearly expects that you already know how to manipulate objects and that the workings of the HTML histogram is more-or-less obvious to you. But this is also D3's strength. I had a mini-rant last time about the documentation of dataviz packages, which are often lacking in the item-by-item specifics one needs to make item-by-item changes to a plot. Meanwhile, web developers do nothing better than write web pages documenting web page elements, so the problem is not about finding hidden information but managing the sense of being overwhelmed. You are empowered to make exactly the visualization you want.

Modeling With Data

Pipe-delimited format

Not commas

Pipes

Social

Why I don't call software 'technology'

Patents

Tax and funding

Git subtrees

Prefixes

git subtree

The demo script

Git submodules

Separate subdirectory

submodule

Platforms for reliable research

There's a standard

Towers

D3: a travelogue

The DOM

The HTML Histogram

The Grammar of Graphics