24 May 17.

### Git subtrees

This continues the last entry about submodules. The two setups are not interchangeable: in the setup there, the repositories are tracked entirely separately, and you won't be able to read changes in the submodule in the parent's commit history. In the subtree setup here, your changes will appear in the parent's history, but we'll have a script that extricates the subtree-only changes should we need to push them back to the source of this subpart.

###### Prefixes

Here's another feature of git you might not have known about: git read-tree. It is normally used in the background as part of the merge process to read the tree of files in a commit and merge them with what's in your directory at the moment. But the cool part is that you can specify a --prefix giving the name of a subdirectory to read into. If you have a side-branch in your extant repository that you want to compare with what you have checked out right now, you could use mkdir tmp; git read-tree otherbranch --prefix=tmp -u to put the contents of otherbranch in a subdirectory for your side-by-side comparison to the main project. Without the -u, the subtree is put in the index but not the working dir, which makes sense for the original purpose of merging trees; with the -u it appears in the working directory as well. If you just want the tree in the working directory, do a git reset to clear out the index after the read-tree step.

So you have another repository elsewhere. You could make it a side-branch by adding a remote (git add remotesub http://xxx), and then git fetching all the remote branches. Unlike typical branches, these fetched branches probably have nothing in common with master in your main project beyond the initial null commit. But now that the subproject lives in a side-branch, you can git read-tree --prefix=... the files into a subdirectory as above, and after progressing, merge changes from the subtree back to the side-branch. So the three-part flow of data is subtree <—> branch <—> origin repository.

You'll see the --prefix option in a lot of manual pages, if you look for it. For merge, specify the subtree via a subtree merge strategy, like git merge -s subtree=thatsubdir mergetarget, though calling it a merge strategy seems like a misnomer to me.

I got this from the git book, which takes you through the details and flags.

With this prefix business, your checkout is recorded by the parent repository itself—there's no subsidiary .git directory. Check-ins will continue to affect the parent, unless you want to get really fancy about only pushing some check-ins to the side-branch and removing them from the parent's history.

Where's the metadata telling you that you built the subtree from a side-branch? It isn't there. If your side-branch is pulled from a remote origin, it knows where that is, as usual. But when you use --prefix to dump the side-branch's contents to a subdirectory, no annotation is made that this subdirectory is in any way special, and it's up to you to remember where it came from. Of course, you could make another script or makefile target to pull and push between the subdirectory and the repository branch.

###### git subtree

On to the git subtree command. It does all of the above, with a little more intelligence. When you pull a distant repository into a subdirectory, all the commits are replayed as part of the main repository, as if you had been working in that subdirectory all along. You will check in new commits to the parent repository, as a unified whole. But if you want to push back to the subproject's origin repository, git subtree will go commit by commit and push only the subtree-relevant parts of each (skipping those commits that didn't touch the subtree). This is a nice bit of work none of us want to replicate, but at the core of the script, it's using git read-tree --prefix=... and git merge -s subtree=..., just as we did manually above.

Where's the metadata? Some of it is in the log. Here's the machine-oriented log entry I got when I first added a subtree to one of my projects via git subtree add ...:

Add 'pbox_proof/' from commit '7d2895ab...7f2a5e76'

git-subtree-dir: pbox_proof
git-subtree-mainline: 14e46e69...1037f3
git-subtree-split: 7d2895ab0...77f2a5e76


The mainline and split references will be used by the subtree script internals, but you can see the name of the directory where the sub lives (twice, even), and you can easily grep for it. If you're going to be a heavy subtree user, the Internet recommends adding this to your .gitconfig in your home directory:

[alias]
ls-subtrees = !"git log | grep git-subtree-dir | awk '{ print $2 }'"  With this, git ls-subtrees finds each marker in the log and prints the name. Of course, I have to know to do this, and if you gave this repository to a colleague, they will too. Also, we're missing the metadata about where the origin is that we cloned from, which you'll again have to have in a makefile or readme (or you can clone to a side-branch and then push/pull from the side branch as above, but that seems like more effort than it's worth). For all the details, the usual man page at man git-subtree can give you the switches for add, merge, push, pull. Split is cool, but I take it to be internals for merge/push/pull. If we weren't the sort of people who use git, both the subtree and submodule setups would seem to be sufficient: you have all the tools you need to treat a subdirectory as a repository, either with its own .git branch or as a part of the parent's commit history, that could in either case be pushed to its origin as needed. But you have to know that the subtree is special and communicate how to handle it to colleagues, human to human. This sort of un-automation is incongruous to how we, the sort of people who take the trouble to read blogs about git, want things to work. We can partially solve the problem with post-tests like the git isclean script I mentioned last time or pre-commit hooks, but the efficient solution to the many problems discussed above and elsewhere may be to just add a readme file. ###### The demo script Similar to last time, here's a demo script to cut and paste (probably section by section) to your command line. It builds a parent and sub repository, grafts the sub into the parent as a subtree, and pushes some changes made in the consolidated project back to the sub. Most of it is setup, identical to last time; the subtree action happens after the commit with the message "Set up parent". mkdir tree_demo #everything will be in here. Clean up with rm -rf tree_demo cd tree_demo # Some admin junk; please ignore bold_cyan="\033[1;36m" no_color="\033[0m" Divider="${bold_cyan}―――――${no_color}" alias Print='echo -e \\n$Divider $*' # Thanks for your patience Print "This script demonstrates using git subtree to transfer some changes around * create submod and larger_work repositories. This matches the setup from last time with submodules. * add the sub as a subtree to larger_work * make changes to the subtree * push those to the subtree repository" Print Print Generate a repository that will be the subtree, with two files. mkdir submod; cd submod git init echo "This repository provides a system to analyze the contents of a directory on a POSIX filesystem" > Readme echo "ls" > directory_analyze chmod +x directory_analyze git add * git commit -a -m "Set up submodule" # Others can't push to a branch you have checked out, so switch to a fake branch git checkout -b xx Print Now set up the parent module. No subs yet. cd .. mkdir larger_work; cd larger_work git init echo "Find information about a file. Usage: info yr_file" > Readme echo 'fs_analyze/directory_analyze | grep$1' > info
chmod +x info

git commit -a -m "Set up parent"

git subtree add -P fs_analyze ../submod master

Print Modify both the parent and child here; commit

echo "ls -l" > fs_analyze/directory_analyze

Print Here are the changes to commit:
git diff
git commit -a -m "Changes made"

Print Push to the original submodule repository
git subtree push -P fs_analyze ../submod master

Print "Go to the sub; see what change(s) got recorded in the sub"
cd ../submod
git checkout master


23 May 17.

### Git submodules

So you have two repositories and you want to treat one as a subpart of the other. Maybe you've segregated your project into the general-use library to distribute widely and the specific case that nobody else will care about. Maybe you have one person in your group who, for whatever reason, you want focused on only one subtask. Or your organization deems part of your project to be sensitive but you really want to work from home on the rest of it. I had one big project a few months ago, and thought it would be less unruly to start work on a new part in a separate repository—but then I had the problem of merging that part into the main while retaining the revision history.

By the way, the world (well, 11 people) wanted me to write this:

I found the abundance of the Web to be confusing for two reasons. First, there are ways to do it that are deemed archaic, whose instructions are still online, in reputable sources which haven't been updated yet. Second, there are two distinct threads, using different methods: do you want the subtree to remain its own repository with its own revision control, or do you want it to be tracked by the parent directory? The second option will be covered next time; here's discussion of the first.

#### Separate subdirectory

As a word of background, remember that which git repository you're in is determined by the first directory (self, parent, grandparent, ...) that has a hidden .git directory holding all the meta-junk. So if you have one git-controlled directory, and you make a subdirectory subtree and run git init in that subdirectory, then everything you do in subtree is tracked by its own git machinery, not that of the parent directory.

So here's the easiest way to turn a distinct repository into a submodule of your parent repository: leave a note in your parent repository, Hey, thanks for cloning my repository. Now go run git clone http://XXX here in the parent repository, and check out revision abc123. Thx. I would formalize this note into a makefile; here's an example with discussion to follow:

Repository=http://example.com/subrepo.git
Version=abc123

run: sub/somefile.c
[compilation instructions here]
[now run]

sub/somefile.c:
make init

init:
git clone $(Repository) sub cd sub; git checkout$(Version)


The init target clones the subrepository and checks out the appropriate version. Because the run target depends on a file in the sub, make init gets run the first time anybody tries to make run. The cloning gives a subdirectory with its own .git machinery, so we've basically achieved the goal.

If you're not a makefile user, you surely have your own way of running scripts. It would be nice to have the scripts run automatically; hooks don't really do it because you can't check a hook into the repository such that it'll run when you clone the repository. We'll get another partial automation below.

How would you maintain this? If you need exactly revision abc123, and are never going to modify the subtree, then you have few needs and few problems. If you are OK tracking a branch like master, whatever state it is in, you have even fewer needs.

But if you are modifying and checking in the sub, now you have two repositories to worry about. First, when you make a change in the sub, you may forget that it's a sub, and commit changes in the parent. You could maybe install a hook to check in the sub when you check in the parent (and you could maybe have make init install that hook). I wrote a script to check whether a repository is clean (everything checked in, no detached head, no stashes, ...), which goes into submodules and checks their status as well; see this previous post.

Second, once you've checked in the sub, do you need to change the ID in the makefile (or other script) from Version=abc123 to Version=def456? If so, then now you have two check-ins to make: one for the sub you modified, then one to check in the revised makefile.

When you push to the origin repositories after updating the sub and parent's makefile, you'll have to do it in two separate steps. When your colleagues pull the sub, they may have to make sure that they are pulling the right version, or the parent has some way of correctly updating the sub after it gets pulled. If you pushed the parent, now referring to def456, and forgot to push that commit to your group's shared repository for the sub, your colleagues are going to to find you and talk to you.

###### submodule

Moving on to formal git tools, there's git submodule. It's evidently an emulation of a feature of Subversion. One of the design principles for git was to do the opposite of whatever Subversion did, and it looks like this time they went with what Subversion did and people hated it.

You call the command with the name of a repository and a subdirectory to clone it into, and git clones the repository, and stores the metadata, about the origin repository and which subdirectory has the sub, in a file named .gitmodules in the parent's base directory. This file is part of the repository (after you commit), and when you clone a copy with a .gitmodule file, you can run git submodule update --init to check out the sub as per the stored metadata.

But unlike the makefile, that .gitmodules directory only has the paths, not the commit you've checked out. The commit ID, like abc123, is stored in the metadata of .git/modules/.... That the commit ID is internally stored instead of transparently held in a makefile or .gitmodules has pros and cons: you can't directly modify it, you can't set it to an exotic alternative to a commit ID (like, say, master), but the submodule system knows when the sub changes commit names, and you can just run git commit in the parent to update the internal annotation of the current sub.

As a digression, let me express one frustration from the submodule suite: git submodules update will “update the registered submodules to match what the superproject expects by cloning missing submodules and updating the working tree of the submodules.'' So it feels to me like it should've been named git submodule reset. If you think it updates the metadata the parent has, you're wrong (use git commit in the parent); if you think it takes an existing sub and updates it, you're wrong, because it ditches your last commit and pulls the one the parent knows. If you make this mistake (like I did a hundred times), use git reflog to look up the ID you need to get back to.

I gave you the makefile example first because I think it's a useful mental model for how the two independent repositories will work, with a parent that tracks an independent sub using only metadata, which has to be kept up-to-date because there is a certain commit name the parent is tracking that needs to be updated, in plain text for the makefile approach and in the tracked metadata for the submodule system. The submodule version hides this behind some git chrome and a few commands that save you the trouble of typing mkdir and git clone and so on, but the model is similar enough that the pitfalls are similar: after you modify the sub, the parent is pointing to the wrong commit ID for the sub until you update/commit the parent, and it's up to you to make sure that colleagues can achieve the same sync.

It seems this is as good as it's going to get given the goods git gave us. We can leave some metadata notes in the parent, but it's still up to you and your colleagues to check that everything is in sync. We hate it when something isn't fully automated, but there it is.

Next time, I'll cover the case where you want the parent's commit history to include all the changes in the sub.

But meanwhile, here is a demo script for you to cut/paste onto the command line (probably piece by piece) to see how some work with these things might go. Or, this post has further tricks and syntax notes—scroll down past the pages of caveats to get to the part where the author describes the commands and options.

mkdir mod_demo  #everything will be in here. Clean up with rm -rf mod_demo
cd mod_demo

bold_cyan="\033[1;36m"
no_color="\033[0m"
Divider="${bold_cyan}―――――${no_color}"
alias Print='echo -e \\n$Divider$*'

Print "This script demonstrates using git submodule to transfer some changes around
* create submod and larger_work repositories
* add the sub as a submodule to larger_work
* clone larger_work to lw2
* make changes to the sub in lw2, push those to the submod repository
* go back to larger_work and try to recover those changes from lw2."
Print

Print Generate a repository that will be the submodule, with two files.

mkdir submod; cd submod
git init
echo "This repository provides a system to analyze the contents of a directory on a POSIX filesystem" > Readme
echo "ls" > directory_analyze
chmod +x directory_analyze

git commit -a -m "Set up submodule"

# Others can't push to a branch you have checked out, so switch to a fake branch
git checkout -b xx

Print Now set up the parent module. No subs yet.
cd ..
mkdir larger_work; cd larger_work
git init

###### Documentation by example

Documentation in this genre leans toward examples of the SDVW. Here are some bl.ocks, now cut/paste/modify them to what you want. I found this example-driven form to be amazingly consistent across packages and authors in this space. It's not just the official documentation and blog entries: I have access to Safari Books self-promotion disclaimer: every time you read a page from 21st Century C there, they pay me a fraction of a cent., where I went hoping to find full expositions going beyond worked SDVW examples, but I instead found more detailed and extensive SDVWs. I'd check a question at Stack Overflow, and the answer is a complete example preceded by here, try this.' This is in contrast to other times I've gone to Stack Overflow and get the usual paragraphs of mansplaining about a single line of code.

Example-driven documentation is a corollary to the lack of orthogonality. How does it make sense to modify the points in a scatterplot when you haven't even produced the plot? But because each SDVW is different, either the example did what you are trying to do and you've won, or it isn't and you are right where you started.

Every package provides examples of the top three or four things you can do to modify the axes, but surprisingly few take the time to provide a boring page listing all the things you can do with the axes and how, or the page is sparse and descriptions point you back to the examples. Sometimes the answer on Stack Overflow would be a one-setting tweak which is never mentioned in the official documentation. I'm harping on documentation because I found it to be the biggest indicator of whether the authors were shooting for ease of initial use or ease of use, and on a cultural note it shows a clear difference between how dataviz package authors understand their users and how general data analysis package authors see their users.

That concludes part one of my travelogue. These systems do amazing things, but this is my confession that the full SDVW is still a slog to me. The standard data viz workflow is, to the best of my understanding, standard, yet being a tourist across packages required adapting to a new idiosyncratic way of walking through the SDVW every time. This may be because of the vagaries of how different base layers were designed, the (possibly hubris-driven) sales pitch that all you need is a top-layer command and you'll never need to change anything else, or the fundamental non-orthogonality of a data visualization.

8 August 16.

### Murphy bed projects

People in sitcoms have jobs. They have a routine that allows similar things to happen every week. If people in movies have jobs, the job is an irrelevance mentioned in passing, or they are full-time spy assassin hunters.

I want my work narrative to be about projects rather than a continuous stream of existence with no set ending. People in movies lead more interesting lives.

I've made a real effort to switch over to project-oriented thinking all the time, and it does feel better. I sit down to work, and I see a set of finite things to build, not a never-ending slog. Everything (including the admin stuff) is in its own repository to check out as needed.

###### Murphy bed projects

To give another metaphor: the Murphy beds you find in tiny studio apartments. Once the bed is folded down from the wall, it covers the space and there is nothing to do but be in bed; once the bed is folded back up into its closet, you don't think about the bed at all. I've had the joy of sleeping in a few, and I think they're great.

It clearly dovetails with the project-centric life. Work on the project until you hit your stopping point, then put up the Murphy bed, pack up the tool box, fold up the tent, or whatever other physical metaphor you want, and move on.

My home directory, conceptual view. via

If you leave the cat on the Murphy bed when you fold it up, you will know. The process of being ready to fold everything away forces the discipline of stopping to ask what needs cleaning away, what threads of thought are still open, and what is to be done about it all.

If I'm going to check out the project fresh every morning, I need a makefile describing every setup detail—that's a good thing. It should have a make clean target to throw out the things I know are temp files or that can be regenerated—that's useful. I'm using revision control, which tracks some files and doesn't track others, so I have to decide early whether a given file is important enough to track, and what to do about it if not—that's so much better than my old routine of getting up to my ears in semi-important files and then feeling overwhelmed and shoving them into a temp directory I never look at again.

There's a definite trend toward being able to fold even up the entire virtual computer, which is stored in The Cloud or on your repository of virtual images. Personally, I'm not there yet, because even on an æsthetic level this isn't doing much for me beyond the usual PC-as-server setup that I have. I mean, you're always going to be typing into something. Without The Cloud, I'm going to have the usual home directory and an archive of projects from which I can pull.

###### My home

What's in my home directory? An archive directory, a temp directory, and that's it.

My home directory, actual view

The archive directory holds all those project repositories, a library of PDFs and data sets, items from my history.

I'm trying to get rid of the temp directory but can't let go yet. I thought it was weird when Lisa Rost said she used her desktop as her temp directory, but now I see that she makes a lot of sense. At the end of the day, if there are stray temp files in my home, they are blatantly present and I want to destroy them.

As a half-digression, I've taken to keeping one (1) text file with all my little side-notes on all of my projects. The notes use Org mode, which is a common standard for writing outlines. My text editor vim, with the orgmode plugin folds the inactive outline segments out of view, so I get the same Murphy bed effect of starting with a near-blank slate, unfolding the current project's notes, and leaving everything else hidden away. This text file of side-notes has massively cut down on my count of stupid two-line text files, and also serves as my index of all in-progress projects.

So, setup is easy: when I want to work on a project, I open a new terminal (actually, I make a new work session in tmux), git clone a copy from the bare repository in the archive directory, unfold the segment in my notes, go. When I want to shut down the project for the day, rm -rf the directory, close the tmux session, fold away the notes, go make some tea.

Perhaps you tensed up as much as I do at the part where I rm -rf the directory. In fact, with version control there are several ways to lose data beyond just deleting a file.

• Yes, I could delete an untracked file.
• I could have a stash that gets deleted.
• I could have diffs that I haven't committed.
• I could have everything committed locally but not pushed to the archive.

So I wrote a script to check for all of these things. People have told me that some DVCS GUIs do some number of these things. It's turned out to be useful to have a command that I can call quickly, because I can call it all the time to check the state of things (even when I have terminal-only access).

I named the script git-isclean, and posted it on Github at that link. If it gives me green check marks, I'm done; else it will (with the -a flag) automatically help me with the next step in cleanup. It depends on the interactive status script I wrote before, because it makes perfect sense to use it; if you don't want to use that script, change the use of git istatus to git status`.

Sample usage.

Given my goal of keeping my home directory empty save for the one project I am working on right now, the need for this script is obvious. But it is useful even in less sparse workflows, because it provides a little to-do list of loose ends, worth having any time.

I hope something there was useful to you or gave you some ideas. Last time I wrote a navel-gazing post about my workflow was in 2013. Maybe I'll do another one in 2019.