m4 without the misery
23 January 15. [link] PDF version
I presented at the DC Hack and Tell last week. It was fun to attend, and fun to present. I set the bar low by presenting my malcontent management system, which is really just a set of shell and m4 scripts.
I have such enthusiasm for m4, a macro language from the 1970s that is part of the POSIX standard, because there aren't really m4 files, the way there are C or Python or LaTeX files. Instead, you have a C or Python or LaTeX file that happens to have some m4 strewn about. Got something that is repetitive that a macro or two could clean up? Throw an m4 macro right there in the file and make immediate use. And so, m4 is the hammer I use to even out nails everywhere. C macros can't generate other C macros (see example below), and LaTeX macros are often fragile, so even where native macros are available it sometimes makes sense to use m4 macros instead. Even the markup for this column is via m4.
What discussion there was after the hack and tell was about how I can even use m4, given its terrible reputation as a byzantine mess. So an inquiry must be made about what I'm doing differently that makes this tolerable. I was in a similar situation with the C programming language, and my answer to how I use C differently from sources that insist that it's a byzantine mess turned into a lengthy opus.
M4 is simpler, so my answer is only a page or two.
I assume you're already familiar with m4. If not, there are a host of tutorials out there for you. I have two earlier entries, starting here. Frankly, 90% of what you need is m4_define, so if you get that much, you're probably set to start trying things with it.
As above, m4 passes not-m4-special text without complaint, but it is very aggressive in substituting anything that m4 recognizes. This leads to the advice that for every pair of parens, you should have a pair of quote-endquote markers to protect the text, which leads to m4-using files with a million quote-endquote markers.
I've found that this advice is overcautious by far.
In macro definitions, the `laziness' of the expansion is critical (do I evaluate $# when the macro is defined, when it is first called, or by a submacro defined by this macro?), and the quote-endquote markers are the mechanism to control that timing. This is a delicate issue that every macro language capable of macro-defining macros runs into. My only advice is to read the page of the manual on how macro expansion occurs very carefully. The first sentence is a bit misleading, though, because the scan of the text is itself treated as a macro expansion, so one layer of quote-endquote markers are stripped, dnl is handled, et cetera. But because I am focused on writing my other-language text with support from m4, not building a towering m4 edifice, my concern with careful laziness control is not as great.
So my approach, instead of putting hundreds of quotes and endqoutes all over my document, is to know what the m4 specials are, and make sure they never appear in my text unless I made an explicit choice to put them there.
The specials
Outside of macro definitions themselves (where dollar signs matter), there are five sets of m4-special tokens. There's a way to handle each of them.
- quote-endquote markers. How do you get a quote marker or an endquote marker into your
text without m4 eating it? The short answer: you can't. So we need to change those markers
to something that we are confident will never appear in text. I use <| and |>.
The longer answer, by the way, is that you would use a sequence like
m4_changequote(LEFT,RIGHT)<|m4_changequote(<|,|>)
Easier to just go with something that will never be an issue.
- Named macros. You are probably using GNU m4—I personally have yet to encounter a non-GNU implementation. GNU
m4 has a -P option that isn't POSIX standard, but is really essential. It puts an
m4_ before every macro defined by m4:
define $\Rightarrow$ m4_define, dnl $\Rightarrow$ m4_dnl, ifdef
$\Rightarrow$ m4_ifdef, and so on. We're closer to worry free: there could easily be
the word define in plain text, but the string
m4_define only appears in macro definitions and blog entries about m4.
We can also limit the risk of accidentally calling a macro by expanding it to something else iff it is followed by parens. The m4 documentation recommends defining a define_blind macro:
m4_changequote(<|, |>) m4_define(<|define_blind|>, <|_define_blind(<|$1|>, <|$2|>, <|$|><|#|>, <|$|><|0|>)|>) m4_define(<|_define_blind|>, <|m4_define(<|$1|>, <|m4_ifelse(<|$3|>, <|0|>, <|<|$4|>|>, <|$2|>)|>)|>) sample usage: define_blind(test, Hellooo there.) test this mic: test()
Start m4 -P from your command line and paste this in if you want to try it. You'll see that when test is used in plain text, it will be replaced with test; if parens follow it, it will be replaced with the macro expansion.
- #comments. Anything after an octothorp is not expanded by m4, but is passed through to output. I think this is primarily useful for debugging. But especially since the advent of Twitter, hashtags appear in plain text all over the place, so suppress this feature via m4_changecom().
- Parens. If parens aren't after a macro name, they are ignored. Balanced parens are always OK. The only annoyance is the case when you just want to have a haphazard unbalanced open- or end-paren in your text, ). You'll have to wrap it in quote-endquote markers <|)|> if it's inside of a macro call, or we can't expect m4 to know where the macro is truly supposed to end.
- Commas. There's no way within m4 to change the comma separator. This can mess up the count of
arguments in some cases,and m4 removes the space after any comma inside a macro,which
looks bad in human text. My three-step solution:
- Use sed, the stream editor, to replace every instance of , with <|,|>
- Use sed to replace every instance of ~~ with a comma.
- Pipe the output of sed to m4.
That is, I used sed to turn ~~ into the m4 argument separator, so plain commas in the text are never argument separators themselves.
So, let's reduce the lessons from this list:
- m4 -P
- m4_changequote(<|, |>)
- m4_changecom()
- Use a unique not-a-comma for a separator, e.g., ~~
- Use sed to replace all actual commas with <|,|>.
Is that too much to remember? Are you bash or zsh user? Here's a function to paste onto the command line or your .bashrc or .zshrc:
m5 () { cat $* | sed 's/,/<|,|>/g' | sed 's/\~\~/,/g' | \
m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") -
}
Now you can run things like m5 myfile.m4 > myfile.py.
At this point, unless you are writing m4 macros to generate m4 macros, you can write your Python or HTML or what-have-you without regard to m4 syntax, because as long as you aren't writing m4_something, <|, |>, or ~~ in your text, m4 via this pipeline just passes your text through either to your defined macros or standard output without incident.
Are there ways to break it? Absolutely. Can you use these steps to more easily build macros upon macros upon macros? Yes, but that's probably a bad idea in any macro system. Can you use this to replace repetitive and verbose syntax with something much simpler, more legible, and maintaiable? Yes, when implemented with apropriate common sense.
An example
Here is a sample use. C macro names have to be plain text—we can't use macro tricks when naming macros. But we can use m4 to write C macros without such restrictions. This example is not especially good form (srsly) but gives you the idea. Cut and paste this entire example onto your command line to create pants_src.c pantsprogram, pants, and octopants.
#The above shell function again:
m5 () { cat $* | sed 's/,/<|, |>/g' | sed 's/\~\~/,/g' |\
m4 -P <(echo "m4_changecom()m4_changequote(<|, |>)") -
}
# Write m4-imbued C code to a file
cat << '-- --' > pants_src.c
#include <stdio.h>
m4_define(Def_print_macro~~
FILE *f_$1 = NULL;
#define print_to_$1(expr, fmt) \
{if (!f_$1) f_$1 = fopen("$1", "a+"); \
fprintf(f_$1, #expr "== " #fmt "\n", (expr)); \
}
)
int main(){
Def_print_macro(pants)
Def_print_macro(octopants)
print_to_pants(1+1, %i);
print_to_octopants(4+4, %i);
char *ko="khaki octopus";
print_to_octopants(ko, %s);
}
-- --
# compile. Use clang if you prefer.
# Or just call "m5 pants_src.c" to view the post-processed pure C file.
m5 pants_src.c | gcc -xc - -o pantsprogram
#Run and inspect the two output files.
./pantsprogram
cat pants
echo
cat octopants
[Previous entry: "A version control tutorial with Git"]
[Next entry: "Overlapping bus lines"]