23 January 15.

m4 without the misery

link PDF version

I presented at the DC Hack and Tell last week. It was fun to attend, and fun to present. I set the bar low by presenting my malcontent management system, which is really just a set of shell and m4 scripts.

I have such enthusiasm for m4, a macro language from the 1970s that is part of the POSIX standard, because there aren't really m4 files, the way there are C or Python or LaTeX files. Instead, you have a C or Python or LaTeX file that happens to have some m4 strewn about. Got something that is repetitive that a macro or two could clean up? Throw an m4 macro right there in the file and make immediate use. And so, m4 is the hammer I use to even out nails everywhere. C macros can't generate other C macros (see example below), and LaTeX macros are often fragile, so even where native macros are available it sometimes makes sense to use m4 macros instead. Even the markup for this column is via m4.





What discussion there was after the hack and tell was about how I can even use m4, given its terrible reputation as a byzantine mess. So an inquiry must be made about what I'm doing differently that makes this tolerable. I was in a similar situation with the C programming language, and my answer to how I use C differently from sources that insist that it's a byzantine mess turned into a lengthy opus.

M4 is simpler, so my answer is only a page or two.

I assume you're already familiar with m4. If not, there are a host of tutorials out there for you. I have two earlier entries, starting here. Frankly, 90% of what you need is m4_define, so if you get that much, you're probably set to start trying things with it.

As above, m4 passes not-m4-special text without complaint, but it is very aggressive in substituting anything that m4 recognizes. This leads to the advice that for every pair of parens, you should have a pair of quote-endquote markers to protect the text, which leads to m4-using files with a million quote-endquote markers.

I've found that this advice is overcautious by far.

In macro definitions, the `laziness' of the expansion is critical (do I evaluate $# when the macro is defined, when it is first called, or by a submacro defined by this macro?), and the quote-endquote markers are the mechanism to control that timing. This is a delicate issue that every macro language capable of macro-defining macros runs into. My only advice is to read the page of the manual on how macro expansion occurs very carefully. The first sentence is a bit misleading, though, because the scan of the text is itself treated as a macro expansion, so one layer of quote-endquote markers are stripped, dnl is handled, et cetera. But because I am focused on writing my other-language text with support from m4, not building a towering m4 edifice, my concern with careful laziness control is not as great.

So my approach, instead of putting hundreds of quotes and endqoutes all over my document, is to know what the m4 specials are, and make sure they never appear in my text unless I made an explicit choice to put them there.

The specials

Outside of macro definitions themselves (where dollar signs matter), there are five sets of m4-special tokens. There's a way to handle each of them.

  • quote-endquote markers. How do you get a quote marker or an endquote marker into your text without m4 eating it? The short answer: you can't. So we need to change those markers to something that we are confident will never appear in text. I use <| and |>.

    The longer answer, by the way, is that you would use a sequence like

    m4_changequote(LEFT,RIGHT)<|m4_changequote(<|,|>)

    Easier to just go with something that will never be an issue.

  • Named macros. You are probably using GNU m4—I personally have yet to encounter a non-GNU implementation. GNU m4 has a -P option that isn't POSIX standard, but is really essential. It puts an m4_ before every macro defined by m4: define $\Rightarrow$ m4_define, dnl $\Rightarrow$ m4_dnl, ifdef $\Rightarrow$ m4_ifdef, and so on. We're closer to worry free: there could easily be the word define in plain text, but the string m4_define only appears in macro definitions and blog entries about m4.

    We can also limit the risk of accidentally calling a macro by expanding it to something else iff it is followed by parens. The m4 documentation recommends defining a define_blind macro:

    m4_changequote(<|, |>)
    m4_define(<|define_blind|>, <|_define_blind(<|$1|>, <|$2|>, <|$|><|#|>, <|$|><|0|>)|>)
    m4_define(<|_define_blind|>, <|m4_define(<|$1|>, <|m4_ifelse(<|$3|>, <|0|>, <|<|$4|>|>,
                <|$2|>)|>)|>)
    
    sample usage:
    define_blind(test, Hellooo there.)
    
    test this mic: test()
    

    Start m4 -P from your command line and paste this in if you want to try it. You'll see that when test is used in plain text, it will be replaced with test; if parens follow it, it will be replaced with the macro expansion.

  • #comments. Anything after an octothorp is not expanded by m4, but is passed through to output. I think this is primarily useful for debugging. But especially since the advent of Twitter, hashtags appear in plain text all over the place, so suppress this feature via m4_changecom().

  • Parens. If parens aren't after a macro name, they are ignored. Balanced parens are always OK. The only annoyance is the case when you just want to have a haphazard unbalanced open- or end-paren in your text, ). You'll have to wrap it in quote-endquote markers <|)|> if it's inside of a macro call, or we can't expect m4 to know where the macro is truly supposed to end.

  • Commas. There's no way within m4 to change the comma separator. This can mess up the count of arguments in some cases,and m4 removes the space after any comma inside a macro,which looks bad in human text. My three-step solution:

    • Use sed, the stream editor, to replace every instance of , with <|,|>
    • Use sed to replace every instance of ~~ with a comma.
    • Pipe the output of sed to m4.

    That is, I used sed to turn ~~ into the m4 argument separator, so plain commas in the text are never argument separators themselves.

So, let's reduce the lessons from this list:

  • m4 -P
  • m4_changequote(<|, |>)
  • m4_changecom()
  • Use a unique not-a-comma for a separator, e.g., ~~
  • Use sed to replace all actual commas with <|,|>.

Is that too much to remember? Are you bash or zsh user? Here's a function to paste onto the command line or your .bashrc or .zshrc:

m5 () { cat $* | sed 's/,/<|,|>/g' | sed 's/\~\~/,/g' | \
               m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") -
      }

Now you can run things like m5 myfile.m4 > myfile.py.

At this point, unless you are writing m4 macros to generate m4 macros, you can write your Python or HTML or what-have-you without regard to m4 syntax, because as long as you aren't writing m4_something, <|, |>, or ~~ in your text, m4 via this pipeline just passes your text through either to your defined macros or standard output without incident.

Are there ways to break it? Absolutely. Can you use these steps to more easily build macros upon macros upon macros? Yes, but that's probably a bad idea in any macro system. Can you use this to replace repetitive and verbose syntax with something much simpler, more legible, and maintaiable? Yes, when implemented with apropriate common sense.

An example

Here is a sample use. C macro names have to be plain text—we can't use macro tricks when naming macros. But we can use m4 to write C macros without such restrictions. This example is not especially good form (srsly) but gives you the idea. Cut and paste this entire example onto your command line to create pants_src.c pantsprogram, pants, and octopants.


#The above shell function again:
m5 () { cat $* | sed 's/,/<|, |>/g' | sed 's/\~\~/,/g' |\
             m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") - 
      }

# Write m4-imbued C code to a file
cat << '-- --' > pants_src.c

#include <stdio.h>

m4_define(Def_print_macro~~
  FILE *f_$1 = NULL;
  #define print_to_$1(expr, fmt)                   \
    {if (!f_$1) f_$1 = fopen("$1", "a+");          \
    fprintf(f_$1, #expr "== " #fmt "\n", (expr));  \
    }
)

int main(){
    Def_print_macro(pants)
    Def_print_macro(octopants)

    print_to_pants(1+1, %i);
    print_to_octopants(4+4, %i);

    char *ko="khaki octopus";
    print_to_octopants(ko, %s);
}

-- --

# compile. Use clang if you prefer.
# Or just call "m5 pants_src.c" to view the post-processed pure C file.
m5 pants_src.c | gcc -xc - -o pantsprogram

#Run and inspect the two output files.
./pantsprogram
cat pants
echo
cat octopants


3 January 15.

A version control tutorial with Git

link PDF version

This is the revision control chapter of 21st Century C, by me, published by O'Reilly Media. I had to sign over all rights to the book—three times over, for some reason I'm still not clear on. But I was clear throughout the contract negotiations of both first and second editions that I retain the right to publish my writing on this blog, and that I retain the movie rights. The great majority of the content in the book is available via the tip-a-day series from this post et seq, or the chapter-long post on parallell processing in C.

The chapter on revision control gets especially positive reviews. One person even offered to translate it into Portugese; I had to refer him to O'Reilly and I don't know what happened after that. It's in the book because I think it'd be hard to be writing production C code in the present day without knowing how to pull code from a git repository. But in the other direction, this tutorial is not really C-specific at all.

So, here it is, with some revisions, in a free-as-in-beer format, to help those of you who are not yet habitual revision control users to become so. If you like this chapter, maybe let the book buying public know by saying something nice on Goodreads or Amazon. And if you think it'll make a good movie, give me a call.

This chapter is about revision control systems (RCSes), which maintain snapshots of the many different versions of a project as it develops, such as the stages in the development of a book, a tortured love letter, or a program.

Using an RCS has changed how I work. To explain it with a metaphor, think of writing as rock climbing. If you're not a rock climber yourself, you might picture a solid rock wall and the intimidating and life-threatening task of getting to the top. But in the modern day, the process is much more incremental. Attached to a rope, you climb a few meters, and then clip the rope to the wall using specialized equipment (cams, pins, carabiners, and so on). Now, if you fall, your rope will catch at the last carabiner, which is reasonably safe. While on the wall, your focus is not reaching the top, but the much more reachable problem of finding where you can clip your next carabiner.

Coming back to writing with an RCS, a day's work is no longer a featureless slog toward the summit, but a sequence of small steps. What one feature could I add? What one problem could I fix? Once a step is made and you are sure that your code base is in a safe and clean state, commit a revision, and if your next step turns out disastrously, you can fall back to the revision you just committed instead of starting from the beginning.

But structuring the writing process and allowing us to mark safe points is just the beginning:

  • Our filesystem now has a time dimension. We can query the RCS's repository of file information to see what a file looked like last week and how it changed from then to now. Even without the other powers, I have found that this alone makes me a more confident writer.

  • We can keep track of multiple versions of a project, such as my copy and my coauthor's copy. Even within my own work, I may want one version of a project (a branch) with an experimental feature, which should be kept segregated from the stable version that needs to be able to run without surprises.

  • GitHub has about 218,000 projects that self-report as being primarily in C as of this writing, and there are more C projects in other, smaller RCS repository hosts, such as the GNU's Savannah. Even if you aren't going to modify the code, cloning these repositories is a quick way to get the program or library onto your hard drive for your own use. When your own project is ready for public use (or before then), you can make the repository public as another means of distribution.

  • Now that you and I both have versions of the same project, and both have equal ability to hack our versions of the code base, revision control gives us the power to merge together our multiple threads as easily as possible.

This chapter will cover Git, which is a distributed revision control system, meaning that any given copy of the project works as a standalone repository of the project and its history. There are others, with Mercurial and Bazaar the other front-runners in the category. There is largely a one-to-one mapping among the features of these systems, and what major differences had existed have merged over the years, so you should be able to pick the others up immediately after reading this chapter.

Changes via diff

The most rudimentary means of revision control is via diff and patch, which are POSIX-standard and therefore most certainly on your system. You probably have two files on your drive somewhere that are reasonably similar; if not, grab any text file, change a few lines, and save the modified version with a new name. Try:

diff f1.c  f2.c

and you will get a listing, a little more machine-readable than human-readable, that shows the lines that have changed between the two files. Piping output to a text file via diff f1.c f2.c > diffs and then opening diffs in your text editor may give you a colorized version that is easier to follow. You will see some lines giving the name of the file and location within the file, perhaps a few lines of context that did not change between the two files, and lines beginning with + and - showing the lines that got added and removed. Run diff with the -u flag to get a few lines of context around the additions and subtractions.

Given two directories holding two versions of your project, v1 and v2, generate a single diff file in the unified diff format for the entire directories via the recursive (-r) option:

diff -ur v1 v > diff-v1v2

The patch command reads diff files and executes the changes listed there. If you and a friend both have v1 of the project, you could send diff-v1v2 to your friend, and she could run:

patch < diff-v1v2

to apply all of your changes to her copy of v1.

Or, if you have no friends, you can run diff from time to time on your own code and thus keep a record of the changes you have made over time. If you find that you have inserted a bug in your code, the diffs are the first place to look for hints about what you touched that you shouldn't have. If that isn't enough, and you already deleted v1, you could run the patch in reverse from the v2 directory, patch -R < diff-v1v2, reverting version 2 back to version 1. If you were at version 4, you could even conceivably run a sequence of diffs to move further back in time:

cd v4
patch -R < diff-v3v4
patch -R < diff-v2v3
patch -R < diff-v1v2

I say conceivably because maintaining a sequence of diffs like this is tedious and error-prone. Thus, the revision control system, which will make and track the diffs for you.

Git's Objects

Git is a C program like any other, and is based on a small set of objects. The key object is the commit object, which is akin to a unified diff file. Given a previous commit object and some changes from that baseline, a new commit object encapsulates the information. It gets some support from the index, which is a list of the changes registered since the last commit object, the primary use of which will be in generating the next commit object.

The commit objects link together to form a tree much like any other tree. Each commit object will have (at least) one parent commit object. Stepping up and down the tree is akin to using patch and patch -R to step among versions.

The repository itself is not formally a single object in the Git source code, but I think of it as an object, because the usual operations one would define, such as new, copy, and free, apply to the entire repository. Get a new repository in the directory you are working in via:

git init

OK, you now have a revision control system in place. You might not see it, because Git stores all its files in a directory named .git, where the dot means that all the usual utilities like ls will take it to be hidden. You can look for it via, e.g., ls -a or via a show hidden files option in your favorite file manager.

Alternatively, copy a repository via git clone. This is how you would get a project from Savannah or Github. To get the source code for Git using git:

git clone https://github.com/gitster/git.git

The reader may also be interested in cloning the repository with the examples for this book:

git clone https://github.com/b-k/21st-Century-Examples.git

If you want to test something on a repository in ~/myrepo and are worried that you might break something, go to a temp directory (say mkdir ~/tmp; cd ~/tmp), clone your repository with git clone ~/myrepo, and experiment away. Deleting the clone when done (rm -rf ~/tmp/myrepo) has no effect on the original.

Given that all the data about a repository is in the .git subdirectory of your project directory, the analog to freeing a repository is simple:

rm -rf .git

Having the whole repository so self-contained means that you can make spare copies to shunt between home and work, copy everything to a temp directory for a quick experiment, and so on, without much hassle.

We're almost ready to generate some commit objects, but because they summarize diffs since the starting point or a prior commit, we're going to have to have on hand some diffs to commit. The index (Git source: struct index_state) is a list of changes that are to be bundled into the next commit. It exists because we don't actually want every change in the project directory to be recorded. For example, gnomes.c and gnomes.h will beget gnomes.o and the executable gnomes. Your RCS should track gnomes.c and gnomes.h and let the others regenerate as needed. So the key operation with the index is adding elements to its list of changes. Use:

git add gnomes.c gnomes.h

to add these files to the index. Other typical changes to the list of files tracked also need to be recorded in the index:

git add newfile
git rm oldfile
git mv flie file

Changes you made to files that are already tracked by Git are not automatically added to the index, which might be a surprise to users of other RCSes (but see below). Add each individually via git add changedfile, or use:

git add -u

to add to the index changes to all the files Git already tracks.

At some point you have enough changes listed in the index that they should be recorded as a commit object in the repository. Generate a new commit object via:

git commit -a -m "here is an initial commit."

The -m flag attaches a message to the revision, which you'll read when you run git log later on. If you omit the message, then Git will start the text editor specified in the environment variable EDITOR so you can enter it (the default editor is typically vi; export that variable in your shell's startup script, e.g., .bashrc or .zshrc, if you want something different).

The -a flag tells Git that there are good odds that I forgot to run git add -u, so please run it just before committing. In practice, this means that you never have to run git add -u explicitly, as long as you always remember the -a flag in git commit -a.

A warning: It is easy to find Git experts who are concerned with generating a coherent, clean narrative from their commits. Instead of commit messages like “added an index object, plus some bug fixes along the way,'' an expert Git author would create two commits, one with the message “added an index object'' and one with “bug fixes.'' These authors have such control because nothing is added to the index by default, so they can add only enough to express one precise change in the code, write the index to a commit object, then add a new set of items to a clean index to generate the next commit object. I found one blogger who took several pages to describe his commit routine: “For the most complicated cases, I will print out the diffs, read them over, and mark them up in six colors of highlighter…'' However, until you become a Git expert, this will be much more control over the index than you really need or want. That is, not using -a with git commit is an advanced use that many people never bother with. In a perfect world, the -a would be the default, but it isn't, so don't forget it.

Calling git commit -a writes a new commit object to the repository based on all the changes the index was able to track, and clears the index. Having saved your work, you can now continue to add more. Further—and this is the real, major benefit of revision control so far—you can delete whatever you want, confident that it can be recovered if you need it back. Don't clutter up the code with large blocks of commented-out obsolete routines—delete!

A useful tip: After you commit, you will almost certainly slap your forehead and realize something you forgot. Instead of performing another commit, you can run git commit --amend -a to redo your last commit.

An aside: Diff/Snapshot Duality

Physicists sometimes prefer to think of light as a wave and sometimes as a particle; similarly, a commit object is sometimes best thought of as a complete snapshot of the project at a moment in time and sometimes as a diff from its parent. From either perspective, it includes a record of the author, the name of the object (as we'll see later), the message you attached via the -m flag, and (unless it is the initial commit) a pointer to the parent commit object(s).

Internally, is a commit a diff or a snapshot? It could be either or both. There was once a time when Git always stored a snapshot, unless you ran git gc (garbage collect) to compress the set of snapshots into a set of deltas (aka diffs). Users complained about having to remember to run git gc, so it now runs automatically after certain commands, meaning that Git is probably (but by no means always) storing diffs. [end aside]

Having generated a commit object, your interactions with it will mostly consist of looking at its contents. You'll use git diff to see the diffs that are the core of the commit object and git log to see the metadata.

The key metadata is the name of the object, which is assigned via an unpleasant but sensible naming convention: the SHA1 hash, a 40-digit hexadecimal number that can be assigned to an object, in a manner that lets us assume that no two objects will have the same hash, and that the same object will have the same name in every copy of the repository. When you commit your files, you'll see the first few digits of the hash on the screen, and you can run git log to see the list of commit objects in the history of the current commit object, listed by their hash and the human-language message you wrote when you did the commit (and see git help log for the other available metadata). Fortunately, you need only as much of the hash as will uniquely identify your commit. So if you look at the log and decide that you want to check out revision number fe9c49cddac5150dc974de1f7248a1c5e3b33e89, you can do so with:

git checkout fe9c4

This does the sort of time-travel via diffs that patch almost provided, rewinding to the state of the project at commit fe9c4.

Because a given commit only has pointers to its parents, not its children, when you check git log after checking out an old commit, you will see the trace of objects that led up to this commit, but not later commits. The rarely used git reflog will show you the full list of commit objects the repository knows about, but the easier means of jumping back to the most current version of the project is via a tag, a human-friendly name that you won't have to look up in the log. Tags are maintained as separate objects in the repository and hold a pointer to a commit object being tagged. The most frequently used tag is master, which refers to the last commit object on the master branch (which, because we haven't covered branching yet, is probably the only branch you have). Thus, to return from back in time to the latest state, use:

git checkout master

Getting back to git diff, it shows what changes you have made since the last committed revision. The output is what would be written to the next commit object via git commit -a. As with the output from the plain diff program, git diff > diffs will write to a file that may be more legible in your colorized text editor.

Without arguments, git diff shows the diff between the index and what is in the project directory; if you haven't added anything to the index yet, this will be every change since the last commit. With one commit object name, git diff shows the sequence of changes between that commit and what is in the project directory. With two names, it shows the sequence of changes from one commit to the other:

git diff               #Show the diffs between the working directory and the index.
git diff --staged      #Show the diffs between the index and the previous commit.
git diff 234e2a        #Show the diffs between the working directory and the given commit object.
git diff 234e2a 8b90ac #Show the changes from one commit object to another.

A useful tip: There are a few naming conveniences to save you some hexadecimal. The name HEAD refers to the last checked-out commit. This is usually the tip of a branch; when it isn't, git error messages will refer to this as a “detached HEAD.''

Append ~1 to a name to refer to the named commit's parent, ~2 to refer to its grandparent, and so on. Thus, all of the following are valid:

git diff HEAD~4        #Compare the working directory to four commits ago.
git checkout master~1  #Check out the predecessor to the head of the master branch.
git checkout master~   #Shorthand for the same.
git diff b0897~ b8097  #See what changed in commit b8097.

At this point, you know how to:

  • Save frequent incremental revisions of your project.
  • Get a log of your committed revisions.
  • Find out what you changed or added recently.
  • Check out earlier versions so that you can recover earlier work if needed.

Having a backup system organized enough that you can delete code with confidence and recover as needed will already make you a better writer.

The Stash

Commit objects are the reference points from which most Git activity occurs. For example, Git prefers to apply patches relative to a commit, and you can jump to any commit, but if you jump away from a working directory that does not match a commit you have no way to jump back. When there are uncommitted changes in the current working directory, Git will warn you that you are not at a commit and will typically refuse to perform the operation you asked it to do. One way to go back to a commit would be to write down all the work you had done since the last commit, revert your project to the last commit, execute the operation, then redo the saved work after you are finished jumping or patching.

Thus we employ the stash, a special commit object mostly equivalent to what you would get from git commit -a, but with a few special features, such as retaining all the untracked junk in your working directory. Here is the typical procedure:

git stash # Code is now as it was at last checkin.
git checkout fe9c4

# Look around here.

git checkout master    # Or whatever commit you had started with
# Code is now as it was at last checkin, so replay stashed diffs with:
git stash pop

Another sometimes-appropriate alternative for checking out given changes in your working directory is git reset --hard, which takes the working directory back to the state it was in when you last checked out. The command sounds severe because it is: you are about to throw away all work you have done since the last checkout.

Trees and Their Branches

There is one tree in a repository, which got generated when the first author of a new repository ran git init. You are probably familiar with tree data structures, consisting of a set of nodes, where each node has links to some number of children and a link to a parent (and in exotic trees like Git's, possibly several parents).

Indeed, all commit objects but the initial one have a parent, and the object records the diffs between itself and the parent commit. The terminal node in the sequence, the tip of the branch, is tagged with a branch name. For our purposes, there is a one-to-one correspondence between branch tips and the series of diffs that led to that branch. The one-to-one correspondence means we can interchangeably refer to branches and the commit object at the tip of the branch. Thus, if the tip of the master branch is commit 234a3d, then git checkout master and git checkout 234a3d are entirely equivalent (until a new commit gets written, and that takes on the master label). It also means that the list of commit objects on a branch can be rederived at any time by starting at the commit at the named tip and tracing back to the origin of the tree.

The typical custom is to keep the master branch fully functional at all times. When you want to add a new feature or try a new thread of inquiry, create a new branch for it. When the branch is fully functioning, you will be able to merge the new feature back into the master using the methods to follow.

There are two ways to create a new branch splitting off from the present state of your project:

git branch newleaf       # Create a new branch...
git checkout newleaf     # then check out the branch you just created.
    # Or execute both steps at once with the equivalent:
git checkout -b newleaf

Having created the new branch, switch between the tips of the two branches via git checkout master and git checkout newleaf.

What branch are you on right now? Find out with:

git branch

which will list all branches and put a * by the one that is currently active.

What would happen if you were to build a time machine, go back to before you were born, and kill your parents? If we learned anything from science fiction, it's that if we change history, the present doesn't change, but a new alternate history splinters off. So if you check out an old version, make changes, and check in a new commit object with your newly made changes, then you now have a new branch distinct from the master branch. You will find via git branch that when the past forks like this, you will be on (no branch). Untagged branches tend to create problems, so if ever you find that you are doing work on (no branch), then run git branch -m new_branch_name to name the branch to which you've just splintered.

Sidebar: Visual Aids

There are several graphical interfaces to be had, which are especially useful when tracing how branches diverged and merged. Try gitk or git gui for Tk-based GUIs, tig for a console (curses) based GUI, or git instaweb to start a web server that you can interact with in your browser, or ask your package manager or Internet search engine for several more.

Merging

So far, we have generated new commit objects by starting with a commit object as a starting point and applying a list of diffs from the index. A branch is also a series of diffs, so given an arbitrary commit object and a list of diffs from a branch, we should be able to create a new commit object in which the branch's diffs are applied to the existing commit object. This is a merge. To merge all the changes that occurred over the course of newleaf back into master, switch to master and use git merge:

git checkout master
git merge newleaf

For example, you have used a branch off of master to develop a new feature, and it finally passes all tests; then applying all the diffs from the development branch to master would create a new commit object with the new feature soundly in place.

Let us say that, while working on the new feature, you never checked out master and so made no changes to it. Then applying the sequence of diffs from the other branch would simply be a fast replay of all of the changes recorded in each commit object in the branch, which Git calls a fast-forward.

But if you made any changes to master, then this is no longer a simple question of a fast application of all of the diffs. For example, say that at the point where the branch split off, gnomes.c had:

short int height_inches;

In master, you removed the derogatory type:

int height_inches;

The purpose of newleaf was to convert to metric:

short int height_cm;

At this point, Git is stymied. Knowing how to combine these lines requires knowing what you as a human intended. Git's solution is to modify your text file to include both versions, something like:

<<<<<<< HEAD
int height_inches;
=======
short int height_cm;
>>>>>>> 3c3c3c

The merge is put on hold, waiting for you to edit the file to express the change you would like to see. In this case, you would probably reduce the five-line chunk Git left in the text file to:

int height_cm;

Here is the procedure for committing merges in a non-fast-forward, meaning that there have been changes in both branches since they diverged:

  • Run git merge other_branch.
  • In all likelihood, get told that there are conflicts you have to resolve.
  • Check the list of unmerged files using git status.
  • Pick a file to manually check on. Open it in a text editor and find the merge-me marks if it is a content conflict. If it's a filename or file position conflict, move the file into place.
  • Run git add your_now_fixed_file.
  • Repeat steps 3--5 until all unmerged files are checked in.
  • Run git commit to finalize the merge.

Take comfort in all this manual work. Git is conservative in merging and won't automatically do anything that could, under some storyline, cause you to lose work.

When you are done with the merge, all of the relevant diffs that occurred in the side branch are represented in the final commit object of the merged-to branch, so the custom is to delete the side branch:

git delete other_branch

The other_branch tag is deleted, but the commit objects that led up to it are still in the repository for your reference.

The Rebase

Say you have a main branch and split off a testing branch from it on Monday. Then on Tuesday through Thursday, you make extensive changes to both the main and testing branch. On Friday, when you try to merge the test branch back into the main, you have an overwhelming number of little conflicts to resolve.

Let's start the week over. You split the testing branch off from the main branch on Monday, meaning that the last commits on both branches share a common ancestor of Monday's commit on the main branch. On Tuesday, you have a new commit on the main branch; let it be commit abcd123. At the end of the day, you replay all the diffs that occurred on the main branch onto the testing branch:

git branch testing  # get on the testing branch
git rebase abcd123  # or equivalently: git rebase main

With the rebase command, all the changes made on the main branch since the common ancestor are replayed on the testing branch. You might need to manually merge things, but by only having one day's work to merge, we can hope that the task of merging is more manageable.

Now that all changes up to abcd123 are present in both branches, it is as if the branches had actually split off from that commit, rather than Monday's commit. This is where the name of the procedure comes from: the testing branch has been rebased to split off from a new point on the main branch.

You also perform rebases at the end of Wednesday, Thursday, and Friday, and each of them is reasonably painless, as the testing branch kept up with the changes on the main branch throughout the week.

Rebases are often cast as an advanced use of Git, because other systems that aren't as capable with diff application don't have this technique. But in practice rebasing and merging are about on equal footing: both apply diffs from another branch to produce a commit, and the only question is whether you are tying together the ends of two branches (in which case, merge) or want both branches to continue their separate lives for a while longer (in which case, rebase). The typical usage is to rebase the diffs from the master into the side branch, and merge the diffs from the side branch into the master, so there is a symmetry between the two in practice. And as noted, letting diffs pile up on multiple branches can make the final merge a pain, so it is good form to rebase reasonably often.

Remote Repositories

Everything to this point has been occurring within one tree. If you cloned a repository from elsewhere, then at the moment of cloning, you and the origin both have identical trees with identical commit objects. However, you and your colleagues will continue working, so you will all be adding new and different commit objects.

Your repository has a list of remotes, which are pointers to other repositories related to this one elsewhere in the world. If you got your repository via git clone, then the repository from which you cloned is named origin as far as the new repository is concerned. In the typical case, this is the only remote you will ever use.

When you first clone and run git branch, you'll see one lonely branch, regardless of how many branches the origin repository had. But run git branch -a to see all the branches that Git knows about, and you will see those in the remote as well as the local ones. If you cloned a repository from Github, et al, you can use this to check whether other authors had pushed other branches to the central repository.

Those copies of the branches in your local repository are as of the first time you pulled. Next week, to update those remote branches with the information from the origin repository, run git fetch.

Now that you have up-to-date copies of the remote branches in your repository, you could merge one with the local branch you are working on using the full name of the remote branch, for example, git merge remotes/origin/master.

Instead of the two-step git fetch; git merge remotes/origin/master, you can update the branch via

git pull origin master

which fetches the remote changes and merges them into your current repository all at once.

The converse is push, which you'll use to update the remote repository with your last commit (not the state of your index or working directory). If you are working on a branch named bbranch and want to push to the remote with the same name, use:

git push origin bbranch

There are good odds that when you push your changes, applying the diffs from your branch to the remote branch will not be a fast-forward (if it is, then your colleagues haven't been doing any work). Resolving a non-fast-forward merge typically requires human intervention, and there is probably not a human at the remote. Thus, Git will allow only fast-forward pushes. How can you guarantee that your push is a fast-forward?

  • Run git pull origin bbranch to get the changes made since your last pull.
  • Merge as seen earlier, wherein you as a human resolve those changes a computer cannot.
  • Run git commit -a -m "dealt with merges".
  • Run git push origin bbranch, because now Git only has to apply a single diff, which can be done automatically.

To this point, I have assumed that you are on a local branch with the same name as the remote branch (probably master on both sides). If you are crossing names, give a colon-separated pair of source:destination branch names.

git pull origin new_changes:master #Merge remote new_changes into local master
git push origin my_fixes:version2  #Merge the local branch into a differently named remote.
git push origin :prune_me          #Delete a remote branch.
git pull origin new_changes:       #Pull to no branch; create a commit named FETCH_HEAD.

None of these operations change your current branch, but some create a new branch that you can switch to via the usual git checkout.

Sidebar: The Central Repository

Despite all the discussion of decentralization, the easiest setup for sharing is still to have a central repository that everybody clones, meaning that everybody has the same origin repository. This is how downloading from Github and Savannah typically works. When setting up a repository for this sort of thing, use git init --bare, which means that nobody can actually do work in that directory, and users will have to clone to do anything at all. There are also some permissions flags that come in handy, such as --shared=group to allow all members of a POSIX group to read and write to the repository.

You can't push to a branch in a nonbare remote repository that the repository owner has checked out; doing so will cause chaos. If this happens, ask your colleague to git branch to a different branch, then push while the target branch is in the background.

Or, your colleague can set up a public bare repository and a private working repository. You push to the public repository, and your colleague pulls the changes to his or her working repository when convenient. [end sidebar]

The structure of a Git repository is not especially complex: there are commit objects representing the changes since the parent commit object, organized into a tree, with an index gathering together the changes to be made in the next commit. But with these elements, you can organize multiple versions of your work, confidently delete things, create experimental branches and merge them back to the main thread when they pass all their tests, and merge your colleagues' work with your own. From there, git help and your favorite Internet search engine will teach you many more tricks and ways to do these things more smoothly.


9 December 14.

A table of narratives and distributions

link PDF version

Remember your first probability class? At some point, you covered how, if you take sets of independent and identically distributed (iid) draws from a pool of items, it can be proven that the means of those draws will be Normally distributed.

But if you're like the great majority of the college-educated population, you never took a probability class. You took a statistics class, where the first few weeks covered probability, and the next few weeks covered other, much more structured models. For example, the derivation of the linear regression formula is based on assumptions regarding an affine relation between variables and the minimization of an objective that is sensible, but just as sensible as many other possible objectives.

Another way to put it is that the Normal Distribution assumptions are bottom-up, with statements about each draw that lead to an overall shape, while the linear regression is top-down, assuming an overall shape and deriving item-level information like the individual error terms from the global shape.

There are lots of bottom-up models with sound microfoundations, each a story of the form if each observation experienced this process, then it can be proven that you will observe a distribution like this: Polya urns, Poisson processes, orthogonal combinations of the above. In fact, I'm making a list.

Maybe you read the many posts on this blog [this post, et seq] about writing and using functions to transform well-defined models into other well-defined models. A chain of such transformations can lead to an increasingly nuanced description of a certain situation. But you have to start the chain somwhere, and so I started compiling this list.

I've been kicking around the idea of teaching a probability-focused stats class (a colleague who runs a department floated the idea), and the list of narrative/distribution pairs linked above would be the core of the first week or two. You may have some ideas of where you'd take it from here; me, I'd probably have the students code up some examples to confirm that convergence to the named distribution occurs, which leads to discussion of fitting data to closed-form distributions and testing claims about the parameters; and then start building more complex models from these basic models, which would lead to more theoretical issues like decomposing joint distributions into conditional parts, and estimation issues like Markov Chains. Every model along the way would have a plausible micro-story underlying it.

This post is mostly just to let you know that the list of narrative/distribution pairs mentioned above exists for your reference. But it's also a call for your contributions, if your favorite field of modeling includes distributions I haven't yet mentioned, or if you otherwise see the utility in expanding the text further.

I've tried to make it easy to add new narrative/distribution pairs. The project is hosted on GitHub, which makes collaboration pretty easy. You don't really have to know a lot about git (it's been on my to-do list to post the git chapter of 21st Century C on here, but I'm lazy). If you have a GitHub account and fork a copy of the repository underlying the narrative/distribution list, you can edit it in your web browser; just look for the pencil icon.

Technical

The formatting looks stellar on paper and in the web browser, if I may say so, including getting the math and the citations right. There isn't a common language that targets both screen and text for this sort of application, so I invented one, intended to be as simple as possible:
Items(
∙ Write section headers like Section(Title here)
∙ Write emphasized text like em(something important)
∙ Write citation tags like Citep(fay:herriot)
∙ and itemized lists like this.
)

See the tech guide associated with the project for the full overview.

Pandoc didn't work OK for me, and Markdown gets really difficult when you have technical documents. When things like ++ and * are syntactically relevant, mentioning C++ will throw everything off, and the formatting of y = a * b * c will be all but a crapshoot.

On the back end, for those who are interested, these formatting functions are m4 macros that expand to LaTeX or HTML as needed. I wrote the first draft of the macros for this very blog, around June 2013, when I got tired of all the workarounds I used to get LaTeX2HTML to behave, and started entrusting my math rendering to MathJax. The makefiles prep the documents and send them to LaTeX and BibTeX for processing, which means that you'll need to clone the repository to a box with make and LaTeX installed to compile the PDF and HTML.

But the internals are no matter. This is a document that could, with further contributions from you, become a very useful reference for somebody working with probability models—and not just students, because, let's all admit it, working practitioners don't remember all of these models. It is implemented using a simple back-end that could be cloned off and used for generating collaborative technical documents of equal (or better!) quality in any subject.


5 November 14.

Why I capitalize distribution names

link PDF version

There are two ways to think about what a parameterized statistical distribution is.

$\def\Re{{\mathbb R}}$ As a single point: Here, the Normal distribution is a mapping of the form $f:(x, \mu, \sigma) \to \Re^+$. More specifically, it is $f(x, \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi} } \exp(-\frac{(x-\mu)^2}{2\sigma^2})$. Within the infinite space of functions, this is a single point. We often fix certain parameters, and get a function of fewer dimensions, like $f(x, \mu=0, \sigma=1) = \frac{1}{\sqrt{2\pi} } \exp(-\frac{x^2}{2})$.

As a family: under this perspective, when we fix, say, $\mu=2$, $\sigma=1$, we get a Normal Distribution. When we fix $\mu=3$, $\sigma=1$, we get a different Normal Distribution. Here, there is a meta-function of the form $N:(\mu, \sigma) \to (f:x\to\Re^+)$, which defines a family of functions, and produces a series of Normal distribution functions depending on the values to which $\mu$ and $\sigma$ have been fixed.

Both of these approaches are coherent, and if you go with either, I respect you fully. Almost any Wikipedia page about a distribution will jump back and forth between these two interpretations, so at the end it's impossible to say whether a Normal Distribution has the form $f(x, \mu, \sigma)$ or $f(x)$. But any given Wikipage is edited by several people, so finding anacoluthons on Wikipedia is something of a fish-in-a-barrel exercise. But my web analytics software tells me that a large percentage of the readers of this blog are individual human beings; if that's you, I recommend picking one interpretation or the other and sticking with it.

I prefer the single-point characterization over the family. At the least, the meta-function is confusing, and implies the two-step estimation process of fixing the parameters, then grabbing a data point. This is a certain type of workflow that may or may not be what we want.

Of course, this gets into the Bayesian versus Frequentist debate. The stereotypical Frequentist believes that there is a true value of $(\mu, \sigma)$, and our job is to find it. This more closely aligns with a search for a single Normal distribution in the family of Normals. The stereotypical Bayesian doesn't know what to believe, and thinks that reality may even be an amalgam of many different values of $(\mu, \sigma)$. Either perspective works under either the single-point or family interpretation—as they say, mathematics is invariant under changes in notation—but the Frequentist approach more closely aligns with the two-step estimation process of the family interpretation, and the Bayesian approach is much easier to express under the single-point interpretation. My earlier post about Bayesian updating, with frequent integrals of $f(x, \mu, \sigma)$ over parameters certainly would have been more awkward via the family interpretation.

Grammatically, this has a clear implication. If the Normal Distribution is the name for that single expression up there, then its name should be capitalized as a proper noun, like London or Jacob Bernoulli, which are also unique entities. If a normal distribution is one of a family of functions, then it is a class of entities, like cities or people, and should be lower case.

By the way, I used to write “Normal distribution,'' but none of the style books would be OK with that. The C in London City is capitalized; same with the D in Normal Distribution.

There's a bonus of consistency, because so many statistical models are capitalized anyway:

  • Gaussian
  • Poisson
  • OLS
  • F distribution
  • Normal distribution [because the normal distribution can easily confuse the reader (and I prefer it over Gaussian because I'll always choose descriptive over appellative).]

At which point, the few distributions that would be lower-cased under the family interpretation start to stand out and look funny.


3 November 14.

The formalization of the conversation in a social network

link PDF version

Last time, I ran with the definition of an academic field as a social network built around an accepted set of methods. The intent there was to counter all the dichotomies that are iffy or even detrimental (typically of a form pioneered by Richard Pryor: "Our people do it like this, but their people do it like this".)

This time, I'm going to discuss peer-reviewed journals from this perspective, to clarify all the things journals aren't. The short version: if journals are the formalized discussion of a social network built around a certain set of methods, then we can expect that the choice of what gets published will be based partly on relatively objective quality evaluation and partly on social issues. It's important to acknowledge both.

Originally, journals were literally the formalized discussions within a social network. Peer review was (and still is) a group of peers in a social network deciding whether a piece of formalized discussion is going to be useful and appropriate to the group.

An idea that exists only in my head is worthless—somebody somewhere has to hear it, understand it, and think about using it. Because a journal is a hub for the social network built around a known set of tools, I have a reasonable idea of which journal to pick given the methods I used, and what tools readers will be familiar with; readers who prefer certain methods know where to look to learn new things about those methods. So journals curate and set social norms, both of which are important to the process of communicating research.

Factual validity

Something that is incorrect will be useless or worse; work that is sloppily done is unlikely to be useful. So an evaluation of utility to the social network requires evaluating basic validity.

Among non-academics, I get the sense that this is what the peer review process is perceived to be about: that a paper that is peer reviewed is valid; one that isn't is up for debate.

If you think about this for a few seconds, this is prima facie absurd. The reviewers are one or two volunteers who will only put a few hours into this. Peer reviewers do not visit the lab of the paper author and check all the phosphate was cleaned out of the test tubes. They rarely double-code the statistics work to make sure that there are no bugs in the code. If there is a theorem with a four-page proof in the appendix, you've got low odds that any reviewer read it. I have on at least one occasion directly stated in a review that I did not have time to check the proof in the appendix and this has never seemed to affect the editors' decisions either way.

The most you can expect from a few hours of peer review is a (nontrivial and important) verification that the author hasn't missed anything that a person having ordinary skill in the art would catch. Deeper validity comes from a much deeper inquiry that is more likely to happen outside the formalized discussion of a journal.

Prestige

If a journal is the formalized discussion of a social network built around a certain set of methods, we see why journal publications are the gold standard in tenure reviews and other such very important affairs. Academics don't get hired for their ability to discover Beautiful Truths, they get hired for their ability to convince grant making bodies to give grants, to convince grad students and potential new hires to attend this department, and so on. These things require doing good work that has social sway. Each journal publication is a statement that there is a well-defined group of peers who think of your work positively, and publications in more far-reaching journals indicate a more far-reaching network of peers.

Choice of inquiry

Sorry if that sounds cynical, but even in mathematics, whose infinte expanse exists outside of human society, the choice of which concepts are most salient and which discoveries are truly important is chosen by people based on what other people also find to be salient.

Maybe you're familiar with the Beauty Contest, which was a story Keynes made up to explain how money works: the newspaper publishes photos of a set of gals, and readers mail in their vote, not for the one who is most beautiful, but for the one who they expect will win the content. Who you like doesn't matter—it's about who you think others will like. No wait, that isn't it either: what's important is who you think other people will think other people will like. Infinite regress ensues.

When you're chatting with a circle of friends, you don't pick topics that are objectively interesting—that's meaningless. You pick topics of conversation that you expect will be of interest to your friends. Now let's say that you know that after the meeting, your friends will go to RateMyFriends.com and vote on how interesting you would be to other potential friends. Then you will need to pick topics that your friends think will be of interest to other potential friends. You're well on your way to the Beauty Contest (depending on the rating strategy used by raters on RateMyFriends).

The Beauty Contest easily leads to bland least-common-denominator output. You're going to pick the most typically attractive looking gal out of the newspaper, and are going to avoid conversation topics that most would find quirky or odd.

What if day-glo `80s leggings are trendy this year? You might pick the gal in florescent lime green not because her attire is objectively attractive (a view which I really can't endorse), but because the setup of the Beauty Contest pushes you to select contestants who follow the current trends. It's not hard to find examples, especially in the social sciences, where a subject takes on its own life, as this quarter's edition publishes papers that respond to last quarter's papers, that are primarily a response to the quarter before.

Diversity

Even the part where we get a fresh pair of eyes to notice the things the author missed or easy-to-spot blunders is limited, because we're still asking peers. If you ask an anthropologist to read an Econ paper, the anthropologist will tear apart the fundamental assumptions; if you ask an economist to read an Anthro paper, she'll tear apart the fundamental assumptions.

But because journals are the formalized discussions of already-formed social networks, we can't expect a lot of cross-paradigm discussion in the journals or in-depth critiques of the social network's fundamental assumptions.

In the software development industry (which often refers to itself as `the tech industry'), you'll find more than enough long essays about the myth of meritocracy. To summarize: even in an industry that is clearly knowledge-heavy and where there are reasonably objective measures of ability, homophily is still a common and relevant factor. Given that fact of life, promoting the network as a meritocracy does a disservice, implying that whoever won out must have done so because they are the best here in this, the best of all possible worlds. If a person didn't get hired, or their code didn't get used, then it must be because the person or the code didn't have as much merit as the winner. The possibility that the person who wasn't picked does better work but wasn't as good a cultural fit as the person who got picked is downplayed.

Academics, in my subjective opinion, are much more likely to be on guard against creeping demographic uniformity. But an academic field is a social network built around an accepted set of tools, and this definition directly constrains the breadth of methodological diversity. Journals will necessarily reflect this.

The fiction of journals as absolute meritocracy still exists, especially among non-academics who have never submitted to a journal and read an actual peer review, and it has the same implications, that if a work doesn't sparkle to the right peers in the right social network, it must be wrong. And it's especially untrue in the present day, when more good work is being done than there is space in traditional paper journals to print it all.

Conclusion segment

I do think that there is much meritocracy behind a journal. A journal editor is the social hub of a network, so you could perhaps socialize your way into such a job, but you're going to kill the journal if you can't hold technical conversations with any author about any aspect of the field. As a journal reviewer, I have seen a good number of papers that can be established as fatally flawed even after a quick skim. But I would certainly like to see a world where the part about improving the quality of inquiry and the part about gaining approval by a predefined set of peers is more separated than it is now.

Social networks aren't going away, so journals supporting them won't go away. But there are many efforts being made to offer alternatives. It's a long list, but the standouts to me are the Arxiv and the SSRN (Social Science Research Network). These are sometimes described as preprint networks, implying that they are just a step along the way to actual peer-reviewed publication, but if the approval of a social network is not essential for your work, then maybe it's not necessary to take that step. Especially in the social sciences, where review times can sometimes be measured in years, these preprint networks are increasingly cited as the primary source. Even the Royal Society, who started this whole journal thing when it was a homophilic society in the 1600s, has an open journal that “...will allow the Society to publish all the high-quality work it receives without the usual restrictions on scope, length or [peer expectations of] impact.''

PS: Did you know I contribute to another blog on social science and public policy? In this entry and its follow-up I discuss other aspects of the journal system. I wrote it during last year's government shutdown, when I had a lot of free time.


30 October 14.

The difference between Statistics and Data Science

link PDF version

An academic field is a social network built around an accepted set of methods.

Economics has grown into the study of human decision making in all sorts of aspects. At this point, nobody finds it weird that some of the most heavily-cited papers in the Economics journals are about the decision to commit crimes or even suicide. These papers use methods accepted by economists, writing down utility functions and using certain mathematical techniques to extract information from these utility functions. Anthropologists also study suicide and crime, but using entirely different methods. So do sociologists, using another set of tools. To which journal you submit your paper on crime depends on matching the methods you use to the methods readers will be familiar with, not on the subject.

A notational digression: I hate the term `data science'. First, there's a general rule (that has exceptions) that anybody who describes what they're doing as “science'' is not a scientist—self-labelling like that is just trying too hard. And are we implying that other scientists don't use data? Is it the data or the science that statisticians are lacking? Names are just labels, but I'll hide this one under an acronym for the rest of this. I try to do the same with the United States DHS.

I push that the distinction is about the set of salient tools because I think it's important to reject other means of cleaving apart the Statistics and DS networks. Some just don't work well and some are as odious as any other our people do it like this, but the other people do it like this kind of generalizations. These are claims about how statisticians are too interested in theory and too willing to assume a spherical cow, or that DSers are too obsessed with hype and aren't careful with hypothesis testing. Hadley explains that “...there is little work [in Statistics] on developing good questions, thinking about the shape of data, communicating results or building data products'' which is a broad statement about the ecosystem that a lot of statisticians would dispute, and a bit odd given that he is best known for building tools to help statisticians build data products. It's not hard to find people who say that DS is more applied than Stats, which is an environment issue that is hard to quantify and prone to observation bias. From the comment thread of this level-headed post: “I think the key differentiator between a Data Scientist and a Statistician is in terms of accountability and commitment.''

Whatever.

We can instead focus on characterizing the two sets of tools. What is common knowledge among readers of a Stats journal and what is common knowledge among readers of a DS journal?

It's a subjective call, but I think it's uncontroversial to say that the abstract methods chosen by the DSers rely more heavily on modern computing technique than commonly-accepted stats methods, which tend to top out in computational sophistication around Markov Chain Monte Carlo.

One author went to the extreme of basically defining DS as the practical problems of data shunting and building Hadoop clusters. I dispute that any DSer would really accept such a definition, and even the same author effectively retracted his comment a week later after somebody gave him an actual DS textbook.

If you want to talk about tools in the sense of using R versus using Apache Hive, the conversation won't be very interesting to me but will at least be a consistent comparison on the same level. If we want to talk about generalized linear models versus support vector machines, that's also consistent and closer to what the journals really care about.

The basic asymmetry that the price of admission for using DS techniques is greater computational sophistication will indeed have an effect on the people involved. If we threw a random bunch of people at these fields, those who are more comfortable with computing will sort themselves into DS and those less comfortable into Stats. We wind up with two overlapping bell curves of computing ability, such that it is not challenging to find a statistician-DSer pair where the statistician is a better programmer, but in expectation a randomly drawn DSer writes better code than a randomly drawn statistician. So there's one direct corollary of the two accepted sets of methods.

Three Presidents of the ASA wrote on the Stats vs DS thing, and eventually faced the same technical asymmetry:

Ideally, statistics and statisticians should be the leaders of the Big Data and data science movement. Realistically, we must take a different view. While our discipline is certainly central to any data analysis context, the scope of Big Data and data science goes far beyond our traditional activities.

This technical asymmetry is a real problem for the working statistician, and statisticians are increasingly fretting about losing funding—and for good reason. Methods we learned in Econ 101 tell us that an unconstrained set leads to an unambiguously (weakly) better outcome than a constrained set.

If you're a statistician who is feeling threatened, the policy implications are obvious: learn Python. Heck, learn C—it's not that hard, especially if you're using my C textbook, whose second edition was just released (or Modeling with Data, which this blog is ostensibly based on). If you have the grey matter to understand how the F statistic relates to SSE and SSR, a reasonable level of computing technique is well within your reach. It won't directly score you publications (DSsers can be as snobby about how writing code is a “mere clerical function'' as the statisticians and US Federal Circuit can be), but you'll have available a less constrained set of abstract tools.

If you are in the DS social network, an unconstrained set of tools is still an unambiguous improvement over a constrained set, so it's worth studying what the other social network takes as given. Some techniques from the 1900s are best left in the history books, but now and then you find ones that are exactly what you need—you won't know until you look.

By focusing on a field as a social network built around commonly accepted tools, we see that Stats and DS have more in common than differences, and can (please) throw out all of the bigotry that comes with searching for differences among the people or whatever environment is prevalent this week. What the social networks will look like and what the labels are a decade from now is not something that we can write a policy for (though, srsly, we can do better than “data science''). But as individuals we can strive to be maximally inclusive by becoming conversant in the techniques that the other social networks are excited by.

Next time, I'll have more commentary derived from the above definition of academic fields, then it'll be back to the usual pedantry about modeling technique.