Object-oriented programming in C
Here are notes on object-oriented programming (OOP) in C, aimed at people who are OK with C but are primarily versed in other fancier languages.
The OO framework is in some ways just a question of philosophy and perspective: you've got these blobs that have characteristics and abilities, and once you've described those blobs in sufficient detail, you can just set them off to go running with a minimum of outside-the-blob procedural code. If you want to be strict about it, the objects only communicate by passing messages to each other. All of this is language-independent, unless you have a serious and firm belief in the Sapir-Whorf hypothesis.
Scope
Much of object-oriented coding is distinguished via a method of scoping. Scope indicates what, out of the thousands of lines of code and dozens of objects you've written down, is allowed to know about a variable. The rule of thumb for sane code-writing is that you should keep the scope of a variable as small as possible to get the job done. Think of a function as a little black box: you want it to have just as many exposed parts as are necessary to interoperate with the outside world.
From the OOP perspective, this translates into dividing variables into private variables that are only internal to the object, such as the internal state of the car's motor, and things that the whole world can use, such as the location of the car. Thus, every OO language I can think of defines public and private keywords.
But wait, there's more: sometimes, you really have to break the rules, just this once, and check the internal status of the motor. You can make the status variable global, defeating the whole mechanism, or you can define a friend function. Below, we'll have inheritance, and will also need protected scope. Sometimes, the :: operator will get you out of a jam.
That is, we can divide the OOP additions to C's syntax into two parts: syntax to give you stricter, finer control over scope, and syntax to override those stricter controls.
How does C do scope, given that it has (depending on how you count) about two keywords for scope control? The scoping rules for C are defined by the file. A variable in a function is visible only to the function; a variable outside the functions, at the top of a file, is visible only in that file.
A typical file.c will have an accompanying file.h that simply declares variables and functions. If another file includes file.h, then that file can see those variables and functions as well. Thus, the private variables are invisible outside the file, and the public variables declared in the header can be used by the other files where you choose to include file.h.
The variables in the header file need to be declared with the extern keyword, e.g. extern float gas_gauge, which indicates that the variable is actually declared in a single .c file, say dashboard.c, but another .c file that includes this header, maybe motor.c is being made aware that there is an floating-point variable named gas_gauge declared somewhere external to motor.c. As noted, the custom is to put all these externs in a header file and be done with it, but if you want to have especially fine control of scope across files/objects, then you can insert exactly as many extern declarations as you need exactly where you need them; similarly for function declarations.
The naming thing
Objects let you name functions things like move and add and never worry about interfering with fifty other functions with the same names. This is nice, but there is a simple C custom to take care of that: prepend the object name. Instead of the C++ my_data.move(), where you just understand that this move function refers to an apop_data object, you'd have a function with a name like apop_data_move(my_data). There ya go, crisis averted: no name space clashes. Some readers somewhere may complain that the name-prepending is ugly, to which I respond: care less.
But seriously, go have a look at Joel the guru for more on how wonderful naming similar to this can be.
C already has a scoping system comparable to that of C++ if you use the one file-one object rule and a few customs in naming. Adding a whole new syntax for scoping on top of this is basically extraneous, and could create confusion now that you've got two simultaneous scoping systems in action.
Inheritance and overloading
Overloading functions and operators is dumb. Joel's article above has a humorous bit about this, which opens: “When you see the code i = j * 5; in C you know, at least, that j is being multiplied by five and the results stored in i. But if you see that same snippet of code in C++, you don't know anything. Nothing.” The problem is that you don't know what * means until you look up the type for j, look through the inheritance tree for j's type to determine which version of * you mean, et cetera.
Say you have a blob object which includes a cleave method that splits the blob in half, and a blobito object that includes a cleave method that binds together internal elements. You have a blob object named my_b, and, faux pas, think that it is a blobito object. You call the my_blob.cleave() function, expecting that my_b.size will double, but instead it halves.
This may sound like a silly example, but from my experience, the most common use of OO machinery such as inheritance is where two objects are very similar but subtly different. Textbook examples: accountant and programmer objects that both inherit from the officedrone object, or from the U.S., state and district objects that inherit from the generic us_division, and are identical except that the district object has no senators or representatives.
Those situations where two objects are similar and therefore easily confused are the ones where we most need a syntax that breaks when we make a mistake in guessing the type. If you were doing this in C, you would be notified of your error at compile time (because you'd be calling blobito_cleave(my_blob) when you should be calling blob_cleave(my_blob)). In many interpreted languages, you would be notified of your error at run time, or sooner depending on the language. In C++, with appropriately defined methods, you would never, ever be notified of your error. That is, operator overloading allows you to bypass a large number of safety checks.
I promised you notes on how C does it, not rants about overloading, so let us move on to Option B: inheritance via composition. For example, Apophenia has an apop_data type:
typedef struct apop_data{
gsl_matrix *matrix;
apop_name *names;
char ***categories;
int catsize[2];
} apop_data;
This essay was originally written in January 2006, and Apophenia has evolved and stabilized since then. So the sample code is not apop-accurate, but is still fine for getting across the principles discussed here.
In OOP-speak, the apop_data structure is a multiple-inheritance child of the gsl_matrix and apop_name structures (plus an array of strings). All of the functions that operate on these parent objects can act on elements of the child apop_data structure, and life is good. To go further with OOP jargon, C lets you extend a structure via a has-a mechanism: we have a gsl_matrix and want to give it names, so we create a structure that has-a matrix and names. Typical OOP languages allow you to extend via is-a, wherein your named_matrix is-a gsl_matrix plus the additional elements to add names. I'm no OO pro, but I think you're supposed to read those sentences like the Italian chef in a Disney movie.
On the one hand, having only has-a to work with means that if a function acts on a gsl_matrix * you can't transparently call, e.g., apop_pca(apop_data_set) PCA=principal component analysis; you have to know that there's a gsl_matrix inside the data set and that's what's being operated on: apop_pca(apop_data_set->matrix). On the other hand, you can not accidentally call the wrong instance of the function and then spend an hour wondering why the function didn't operate the way you'd expected.
So on the minus side, the internals of the object aren't hidden from you--but on the plus side, things aren't hidden from you.
The void, templates
And finally, for when you really don't want to deal with types, there's the void pointer. Here's a snippet from an early draft of Apophenia's apop_model type:
typedef struct apop_model{
char name[101];
apop_model * (*estimate)(apop_data * data, void *parameters);
...
} apop_model;
Two things to note from this example. First, including a function inside a struct is a-OK. We'll declare a GLS estimation function as static apop_estimate * apop_estimate_GLS(apop_data *set, gsl_matrix *sigma), declare the model via something like: apop_model apop_GLS = {"GLS", apop_estimate_GLS, ...}; and then we can call apop_GLS.estimate(data, sigma); just like we would in C++-land.
If you didn't follow the syntax of declaring hooks for functions, see p 190 of Modeling with Data.
Second, there's the void pointer at the end of the declaration of the estimate method in the structure. Notice that that second argument of apop_estimate_GLS is typed as a gsl_matrix *, even though we're plugging it in where the template asked for a void *. Non-OOP quiz question for the statisticians: why is this a terrible way to implement GLS? Other models require different parameters, like the MLE functions take parameters for the search algorithm, but they're also called via the same model_instance.estimate(data, params) form.
It's up to you, the user, to remember what types make sense for what models, because the void pointer is your way of saying "Dear C type-checker: leave me alone." The type-checker will still check that you're sending a pointer and not data, but from there you're free to live it up and/or segfault.
The void pointer is how you would implement template-like behavior. For example, here is a linked list library (gzipped source) that I wrote when I was avoiding harder work. It links together void pointers, meaning that your list can be a linked list of integers, strings, or objects of any type. How's that for a nice, concrete example.
The primary benefit from C++'s template system over using void pointers is that the template system will still check types. Personally, I've rarely had problems. If I have a list named list_of_data, I know to not add gsl_matrixes to it. Not having type-checking means that it's up to me to make sure that the wrong thing is never the intuitive thing to do.
By the way 1: notice how apop_estimate_GLS is declared to be static, so outside the file it's only accessible as the apop_GLS.estimate() method.
By the way 2: I can't recall ever using this, but if you wanted to, you could even type-cast inside the function:
void move(void *in, char type){
if (type == 'a')
a_move((a_type*) in);
if (type == 'b')
b_move((b_type*) in);
}
This self
I've only wanted something like the this or self keyword maybe twice, but I have no idea how to gracefully implement it in C, if at all. [Maybe with the preprocessor?] So I'm open to suggestions on this one.
OK, there you have it: most of the basics of object-oriented programming implemented via relatively simple techniques in C. The moral: object-oriented coding is a method and a mindset, not a set of keywords.
Refs
More essays along the same lines:
A full book
that goes into great detail about the above simple tricks,
and also goes much further in implementing something that looks like C++.
An article that focuses on encapsulation, with some suggestions on hiding data.
Another article that blew way past my attention span, and basically shows you how to write a C++ compiler in C. Given my disdain for overloading and strict inheritance (as opposed to inheritance via composition), I wasn't really into it.
[link][no comments]
Testing the model using the model.
Three guys are stranded on a desert island
And all they have to eat is a case of canned pears. The joke is that they're all researchers.
The physicist says: `we can mill down these coconut husks into lenses, then focus the heat of the sun on the cans. When their temperature rises enough, the seams will burst!'
The chemists says: `No, that'll take too long. Instead, we can refine sea water into a corrosive, that will eventually just melt the can open!'
The biologist cuts him off: `I don't want salty pears! But I've found a yeast that is capable of digesting metals. With care and cultivation, we can get them to eat the cans open.'
The economist finally stands up and smiles: `You are all trying to hard, because it's very simple: assume a can opener.'
Pause for laughter.
I've found that this joke is so commonly told among economists that you can just tell an economist `you're assuming a can opener' and they'll know what you mean. It's also a good joke for parties because people always come up with new ways to open the can. What would the lit major do?
Cracking open a model with no tools
Now back to the real world. You are running the numbers on a model regarding data you have collected. To keep this simple, let's say that you're running an Ordinary Least Squares regression on a data set of canned pear sales and education levels. You have the data set, then run OLS to produce a set of coefficients, β, and p-values indicating the odds that the βs are different from zero.
Those p-values are generated using exactly the procedure listed above: assume a distribution of the βs, write down its CDF, then measure how much of the CDF you assumed lies between zero and β.
We're still assuming a can opener. We used the assumptions of the model--that errors are normally distributed with mean zero and a variance that is a function of the data--to state the confidence with which we believe the very same model.
To make this as clear as possible: we used the model assumptions to write down a probability function, then used that probability function to test the model. This is exactly how out blog author turned no information at all into an odds measured to five decimal places. But making an assumption does not add information.
Pick up any empirically-oriented journal, and in every paper, this is how the confidence intervals will be reported, by assuming that the model is true with certainty and can be used to objectively state probabilities about its own veracity.
So ¿why doesn't all of academia fall apart?
First, many of the assumptions of these models are rooted in objective fact: given such-and-such a setup, errors really will be Normally distributed. We could formalize this by writing down tests to test the assumptions of the main model, though for our purposes there's no point--they'll just fall victim to the same eating-your-own-tail problem. Even lacking extensive testing, if the data generation process is within spitting distance of a Central Limit Theorem, we'll give it benefit of the doubt that there is an objective truth to the distribution.
Second, we can generalize that point to say that the typical competently-written journal article's assumptions are usually pretty plausible, or at least do little harm. When they report that one option is more likely than another, that is often later verified to actually be true, though the authors had used subjective tools to state subjective odds.
Third, we shouldn't believe p-values--or any one research study--anyway. A model with fabulous p-values will increase our subjective confidence that something real is going on. But if you read that a p-value is 99.98%, ¿do you really believe that in exactly two out of 10,000 states of the world, the difference is not significant? Probably not: you just get a sense of greater confidence.
So this works because we treat the process as subjective. The authors made up a model,
and used that model to state the odds with which the model is true. But if we agree
that the model seems likely, and if we accept that the output odds are just inputs to
our own subjective beliefs, then we're doing OK. Problems only arise when we pretend
that those p-values are derived from some sort of objective probability distribution
rather than the author's beliefs as formalized by the model.
|
on Tuesday, March 2nd, Ted Alper said
The following comment won't make sense because the commenter was correct: my original post was wrong. It had a lot right, and the item I was criticizing did have bugs. But on balance, I needed to lighten up. If you'd like to see what Ted is commenting on, I saved a copy here. --BK |
Keeping paper current
I had ambitious goals for the textbook--so ambitious, they were impossible. Here, I'll talk about the sort of things that keep the typical textbook author (i.e. me) up at night. It is a sort of apology for the fact that one of these three goals eventually had to give.
Goal 1: Write about merging statistical technique with simulation technique.
I've written about merging modeling paradigms before, and I still think it's something that's truly novel about the book. It surprises people. Folks who come from a simulation background typically sense the paradigm-merging intent quickly, and work out how the stats-oriented parts of the book could readily be applied to their computation-oriented models. I've found that people coming from the stats side have more trouble seeing the connection, mostly if they haven't put much thought into computational modeling. But even a number of people from that side eventually came back to me with comments about how that thing they didn't see the point of in Chapter Four turned out to be really useful.
OK, so far so good. Although it's not something everybody particularly cares about, I am happy with the design decision of writing a textbook on merging stats and simulation technique.
Goal 2: No pseudocode
Pseudocode is good for exposition, but it typically doesn't solve the problem the reader has. First, there are often real problems in handling the data that are hidden under the algorithm--and yes, I've tried it in your favorite language; the quirks were different but the overall effort was not.
For example, say that f (⋅) and g(⋅) are both statistical models, implemented in code of arbitrary complexity (assumed away by the pseudocode). A line of math-oriented pseudocode may quip, calculate fog. If your distributions are well-behaved and designed to make this work, then this is trivial; if the outputs and inputs are sometimes incompatible--¿Is the range of g exactly identical to the domain of f? ¿Does the dimension of f's input change with the size of an auxiliary data set not mentioned in the pseudocode?--then you'll spend quality time turning the three characters fog into reliable working code.
Second, if you are doing computationally-intensive work, then technique matters. If your simulation runs in two seconds, then that's not an indication that computing technique is irrelevant and you shouldn't care, but that you should re-run the thing with a hundred times as many elements for a hundred times as long. Your results will be better, and good technique can reduce your run time back down to minutes from hours. Of course, pseudocode assumes away technique.
Third, you can cut and paste real code, but not pseudocode. The first draft of the first chapter had a lot of routines for the reader to cut and paste from the planned online appendix. I'd explain that the GNU Scientific Library, perfect for using in the guts of simulations, doesn't provide anything at the level of a regression, so Figure Eight gives you regression code that you can cut and paste into your project.
And that's where the Apophenia library came from: I eventually got sick of telling the reader to cut and paste snippets, and just bundled the snippets into a package. That freed up the book to include more high-level sample code for cutting and pasting.
I've seen people pick up a copy of my book and immediately feel intimidation that there's code on every page, but I'm still happy with this design decision as well. I take some pride in writing a book that users can directly execute, and that doesn't pretend toward false generality. You know I'm not lying to you or hiding facts from you, because then the code wouldn't run.
Goal 3: Publish the book on paper, bound
There are people who write in their books, and there are people who will prop up short table legs with books, but I ain't one of `em. I don't think of books as newspapers, or as just another prop to be used for whatever practical value it has and then chucked out. Books are a part of the permanent record. If I want my information impermanent and disposable, that's what I've got computer screens for.
Modeling with Data came out in late 2008. It's early 2010 now, and the book, bound and immutable, has remained (almost) entirely in line with the code base. That's taken a lot of work, because the code keeps evolving. Many changes are just additions and new functions or features: means of doing things more easily, which make the code in the book look klunky in comparison but not false, per se.
But you can see how my three goals set me up for failure here:
- Write about novel things that are still being actively explored.
- Be concrete, and help the reader with the details.
- Publish something that will be correct in the details as far into the future as possible.
I think these three goals are, as a practical matter, impossible to reconcile, so every author has to give up on (at least) one. It's an interesting exercise to pick up the typical technical text and see what the author chose. The manuals about a specific language, with a spine measured in decimeters, throw the goal of permanence out the window. These book producers see the book as a newspaper or a parts manual, and if this manual takes off, then there'll be a second edition and you'll be expected to use the first edition for mulch. As above, I feel that this is disrespectful of the reader. Others throw out goal 2 and go straight to pseudocode, because by remaining vague they don't have to worry about the vagaries of coding language grammars. This may not even work: I have seen many pages of pseudocode about how to invert a matrix, but it is effectively obsolete if users have half a line of code in their preferred language that does the whole thing. Finally, there's the option of just playing it safe, and using code that has already been established, tested, and written up in decimeter-thick manuals. Even that eventually fails; maybe have a look at the comments from the author of Dive into Python, where he confess that his plans for a Book for the Ages didn't mean much as Python 2 evolved into Python 3.
So, there you have my apology, as Apophenia evolves but the pages remain
bound, and Goal 3 slowly gives way to Goals 1 and 2. I've set up an
Updates to Apophenia
page explaining changes to Apophenia that would affect readers of the
book. Their effect is of course centered on the sections of the book that
make heavy use of the details of Apophenia's implementation (Sections 4.3
and 4.7 on shunting data, Sections 8.2-8.4 when reading model output,
Chapter 10 when setting up new models). Much of the book is concrete
but covers the not-novel: C, SQL, the GSL, and mathematics comprise most
of the book's pages, and have not undergone much revision. Gosh, most readers won't
even notice the changes, but this really is the sort of thing that the
typical textbook author (i.e. still me) frets about endlessly.
[link][no comments]
The statistics style report
It may sound like an oxymoron, but there is such a thing as fashionable statistical analysis. Where did this come from? How is it that our tests for Truth, upon which all of science relies, can vacillate from season to season like hemlines?
Before discussing those questions, let me tap on the brake, and point out that statistics as a whole is not arbitrary. The Central Limit Theorem is a mathematical theorem like any other, and if you believe the basic assumptions of mathematics, you have to believe the CLT. The CLT and developments therefrom were the basis of stats for a century or two there, from Gauss on up to the early 1900s when the whole system of distributions (Binomial, Bernoulli, Gaussian, t, chi-squared, Pareto) was pretty much tied up. Much of this, by the way, counts not as statistics but as probability.
Next, there's the problem of using these objective truths to describing reality. That is, there's the problem of writing models. Models are a human invention to describe nature in a human-friendly manner, and so are at the mercy of human trends. Allow me to share with you my arbitrary, unsupported, citation-free personal observations.
Number crunching
The first thread of trendiness is technology-driven. In every generation, there's a line you've got to draw and say `everything after this is computationally out of reach, so we're assuming it away', and the assume-it-away line drifts into the distance over time. Here's a little something from a 1939 stats textbook on fitting time trends (Arkin and Colton, 1939, p 43):
To fit a trend by the freehand method draw a line through a graph of the data in such a way as to describe what appears to the eye to be the long period movement. ...The drawing of this line need not be strictly freehand but may be accomplished with the aid of transparent straight edge or a “French” curve.
As you can imagine, this advice does not appear in more recent stats texts. In this respect, a stats text can actually become obsolete. But as time passes, approximations like this are replaced by new techniques that were before just written off as impossible. [Now reading: Hastie and Tibshirani (1990), who offer a few hundred pages on computational methods to do what was done by freehand above.]
Computational ability has brought about two revolutions in statistics. The first is the linear projection (aka, regression). Running a regression requires inverting a matrix, with dimension equal to the number of variables in the regression. A two-by-two matrix is easy to invert (¿remember all that about ad - bc?) but it gets significantly more computationally difficult as the number of variables rises. If you want to run a ten-variable regression using a hand calculator, you'll need to set aside a few days to do the matrix inversion. My laptop will do the work in 0.002 seconds. It's still in under a second up to about 500 by 500, but 1,000 by 1,000 took 8.9 seconds. That includes the time it took to generate a million random numbers.
So revolution number one, when computers first came out, was a shift from simple correlations and analysis of variance and covariance to linear regression. This was the dominant paradigm from when computers became common until a few years ago.
The second revolution was when computing power became adequate to do searches for optima. Say that you have a simple function to take in inputs and produce an output therefrom. Given your budget for inputs, what mix of inputs maximizes the output? If you have the function in a form that you can solve algebraically, then it's easy, but let us say that it is somehow too complex to solve via Lagrange multipliers or what-have-you, and you need to search for the optimal mix.
You've just walked in on one of the great unsolved problems of modern computing. All your computer can do is sample values from the function--if I try these inputs, then I'll get this output--and if it takes a long time to evaluate one of these samples, then the computer will want to use as few samples as possible. So what is the method of sampling that will find the optimum in as few samples as possible? There are many methods to choose from, and the best depends on enough factors that we can call it an art more than a science.
In the statistical context, the paradigm is to look at the set of input parameters that will maximize the likelihood of the observed outcome. To do this, you need to check the likelihood of every observation, given your chosen parameters. For a linear regression, the dimension of your task was equal to the number of regression parameters, maybe five or ten; for a maximum likelihood calculation, the dimension is related to the number of data points, maybe a thousand or a million. Executive summary: the problem of searching for a likelihood function's optimum is significantly more computationally intensive than running a linear regression.
So it is no surprise that in the last twenty years, we've seen the emergence of statistical models built on the process of finding an optimum for some complex function. Most of the stuff below is a variant on the search-the-space method. But why is the most likely parameter favored over all others? There's the Cramer-Rao Lower Bound and the Neyman-Pearson Lemma, but in the end it's just arbitrary. Gauss had no theorems that this framework gives superior models relative to linear projection, but it does make better use of computing technology.
Hemlines
The second thread of statistical fashion is whim-driven like any other sort of fashion. Golly, the population collectively thinks, everybody wore hideously bright clothing for so long that it'd be a nice change to have some understated tones for a change. Or: now that music engineers all have ProTools, everything is a wall of sound; it'd be great to just hear a guy with a guitar for a while. Then, a few years later, we collectively agree that we need more fun colors and big bands. Repeat the cycle until civilization ends.Statistical modeling sees the same cycles, and the fluctuation here is between the parsimony of having models that have few moving parts and the descriptiveness of models that throw in parameters describing the kitchen sink. In the past, parsimony won out on statistical models because we had the technological constraint.
If you pick up a stats textbook from the 1950s, you'll see a huge number of methods for dissecting covariance. The modern textbook will have a few pages describing a Standard ANOVA (analysis of variance) Table, as if there's only one. This is a full cycle from simplicity to complexity and back again. Everybody was just too overwhelmed by all those methods, and lost interest in them when linear regression became cheap.
Along the linear projection thread, there's a new method introduced every year to handle another variant of the standard model. E.g., last season, all the cool kids were using the Arellano-Bond method on their time series so they could assume away endogeneity problems. The list of variants and tricks has filled many volumes. If somebody used every applicable trick on a data set, the final work would be supremely accurate--and a terrible model. The list of tricks balloons, while the list of tricks used remains small or constant. Maximum likelihood tricks are still legion, but I expect that the working list will soon find itself pared down to a small set as optimum finding becomes standardized.
In the search-for-optima world, the latest trend has been in `non-parametric' models. First, there has never been a term that deserved air-quotes more than this. A `non-parametric' model searches for a probability density that describes a data set. The set of densities is of infinite dimension. If all you've got a hundred data points, you ain't gonna find a unique element of ℜ∞ with that. So instead, you specify a certain set of densities, like sums of Normal distributions, and then search for that subset that leads to a nice fit to the data. You'll wind up with a set of what we call parameters that describe that derived distribution, such as the weights, means, and variances of the Normal distributions being summed.
But `non-parametric' models allow you to have an arbitrary number of parameters. Your best fit to a 100-point data set is a sum of 100 Normal distributions. If you fit 100 points with 100 parameters, everybody would laugh at you, but it's possible. In that respect, the `non-parametric' setup falls on the descriptive end of the descriptive-to-parsimonious scale. In my opinion.
I don't want to sound mean about `non-parametric' methods, by the way. It's entirely valid to want to closely fit data, and I have used the method myself. But I really think the name is false advertising. How about distribution-fitting methods or methods with open parameter counts?
Bayesian methods are increasingly cool. If you want to assume something more interesting than Normal priors and likelihoods, then you need a computer of a certain power, and we beat that hurdle in the 90s as well, leaving us with the philosophical issues. In the context here, those boil down to parsimony. Your posterior distribution may be even weirder than a multi-humped sum of Normals, and the only way to describe it may just be to draw the darn graph. Thus, Bayesian methods are also a shift to the description-over-parsimony side.
Method of Moments estimators have also been hip lately. I frankly don't
know where that's going, because I don't know them very well.
Also, this
guy
really wants multilevel
modeling to be the Next Big Thing in the linear model world, and makes
a decent argument for that. He likes it because it lets you have a million
parameters, but in a structured manner such that we can at least focus
on only a few. I like him for being forthright (on the blog) that the
computational tools he advocates (in his books) will choke on large data
sets or especially computationally difficult problems.
Increasing computational ability invites a shift away from parsimony. Since PCs really hit the world of day-to-day stats recently, we're in the midst of a swing toward description. We can expect an eventual downtick toward simpler models, which will be helped by the people who write stats packages--as opposed to the researchers who caused the drift toward complexity--because they write simple routines that implement these methods in the simplest way possible.
So is your stats textbook obsolete? It's probably less obsolete than people will make it out to be. The basics of probability have not moved since the Central Limit Theorems were solidified. In the end, once you've picked your paradigm, not much changes; most novelties are just about doing detailed work regarding a certain type of data or set of assumptions. Further, those linear projection methods or correlation tables from the 1900s work pretty well for a lot of purposes.
But the fashionable models that are getting buzz shift every year, and last year's model is often considered to be naïve or too parsimonious or too cluttered or otherwise an indication that the author is not down with the cool kids--and this can affect peer review outcomes. A textbook that focuses on the sort of details that were pressing five years ago, instead of just summarizing them in a few pages, will have to pass up on the detailed tricks the cool kids are coming up with this season--which will in turn affect peer reviews for papers written based on the textbook's advice.
A model more than a few years old has had a chance to be critiqued while a new model has not. So using an old technique gives peer reviewers the opportunity to use their favorite phrase: the author seems to be unaware, in this case that somebody has had the time to find flaws in the older technique and propose a new alternative that fixes those flaws--while the new technique is still sufficiently novel that nobody has had time to publish papers on why it has even bigger flaws.
All this is entirely frustrating, because we like to think that our science is searching for some sort of true reflection of constant reality, yet the methods that are acceptable for seeking out constant reality depend on the whim of the crowd.
Please note: full references are given in the PDF version
[link][no comments]
git status interactive
One of the first things that struck me as nice about Git was the status command, which produces something just shy of a script for revising the status of all the files. It even gives you tips about how to do common tasks.
I got even more excited when I saw git rebase -interactive, which generates a semi-script, opens it for you to edit, and then runs the thing automatically. That was smooth.
So I expected there'd be a similar procedure like git status -interactive, which, if it existed, would work like this:
- You type git istatus.
- Your favorite editor opens. There, you see the output from git status, plus instructions for some basic commands: put an a at the head of a line to add a file, an i to ignore it from now on, an ea to edit then add (which you'll do if you're merging), an r to remove the file from the repository, and so on.
- You exit, and your instructions are run.
Git doesn't do that. So I wrote a demo script to make that happen, git-status-interactive.
Click that link to save the script to your hard drive, and make it executable via the usual chmod 755 git-status-interactive. You probably want to alias the script using Git's aliasing system. For example, to allow the git istatus command I'd shown above, try this command from your bash prompt, in a single git repository:
git config --add alias.istatus \!/your/path/to/git-status-interactive
Or if you have the permissions to make global changes to the git config:
git config --global --add alias.istatus \!/your/path/to/git-status-interactive
Some further notes
The script is a demo--dead simple, with no serious error checking. To some extent it's a feature request: Dear Git team, please implement something like this in Git, but competently. Also, dear readers, please drop me an email if you've improved this thing for the better.
By the way, Git does have git add -i, which behaves very differently from the edit-a-generated-file mechanism from git rebase -interactive. git add -i doesn't let me tick off files to ignore, and doesn't help immensely during merging; though it will give you more control when adding, like committing changes to sections of a file.
Apart from git status and the shell, I use exactly one program to make this happen: Sed. The prep step runs Sed to take in the output of git status and then remove non-comment lines and insert instructions; the post-editor step run Sed to replace the one-character markers with the full commands. That's all.
Because the modified file just runs as a shell script, you can add other commands as you prefer. For example, replacing the # at the head of the line with an rm turns it into a standard remove command, or you can mv a file that git complains is in the wrong place (probably due to merging issues), et cetera.
In case you missed the link in the text above, download
git status interactive
here.
[link][no comments]
The schism, or why C and C++ are different
Those of you who actually read my posts about efficient computing, rather than just going to read the comics at the first sight of the word `computing', may by now have noticed a few patterns.
The most basic is that standards are important. I know this sounds obvious to you, but if it's so obvious, why do people get it wrong so darn often. Why are people constantly modifying and violating standards that work just fine?
I know many of you have suspected this for a while, but let me state it loud and clear: I am conservative. Rabidly conservative. I think that people need to have a really good reason for not conforming to technical standards, and I think most people don't--they just use the shiniest thing available. A large amount of my writing on technical matters is simply pointing out that well-thought-out technical standards tend to work better than the newest and shiniest, and that the value of stability often more than makes up for inevitable flaws in the standards. Even my work on patents is aimed at making sure that open standards remain open and free to implement.
I originally tried to make this into an essay about both computing standards and general customs, but over the course of writing it, I came to realize that the two are fundamentally different. If somebody doesn't quite conform to your human customs--if they use the wrong fork or speak non-native English or wear ratty t-shirts to the office--then the person will be funny or diverse or annoying or just normal. Meanwhile, if computing standards aren't followed--if somebody gets sick of C's array notation, array[i][j], and decides it looks nicer as array[i, j]--then their writing is 100% gibberish and they might as well be speaking Hindu to an English-speaker. Standards-breaking in social settings can be fun; standards-breaking in computing is just breaking things.
So although I usually try to put something in the technical essays that will be interesting to those who could care less about machinery, I don't think any of the below is truly applicable to social norms. Or you can read on and decide for yourself.
Nor is this a comprehensive essay on standards drift and revolution, because that would take a volume or two. Just file this one as assorted notes on one question with an interesting proposed solution: what to do with all those people who keep trying to revise and update and modify the standards?
Schisms
Intuitively, there's the English-teacher approach to retaining a standard, where we force everybody to stay in line with the basic standard. When you go home to write your pals, your English teacher instructed you, be sure to use perfect grammar at all times.
But another approach is to let the whippersnappers fork. On the face of it, it may seem contradictory to think that splitting a standard in half would somehow make it purer, but under the right conditions, giving those who want to experiment room to do so can be the best approach.
For any technological realm, you've got one set of people who just want features--lots and lots of features, enough to wallow in like they're a bed of slightly moist hundred dollar bills--and you've got another team that wants fewer moving parts, and takes care to maintain discipline and stick to the existing norms. We can bind the two teams together, in which case they will constantly be fighting over little modifications to the system and neither team will be happy. That's what happens with English. Or you can have the schism.
Allow me to cut and paste from Amazon:
The C Programming Language
by Brian W. Kernighan, Dennis
M. Ritchie
274 pages
Publisher: Prentice Hall PTR; 2nd edition (March 22, 1988)
Amazon.com Sales Rank, paperback: #4,457
Amazon.com Sales Rank, hardcover: #445,546
First edition
228pp, 1978:
Amazon.com Sales Rank, paperback: #60,113
The C++ Programming Language
by Bjarne Stroustrup
911 pages
Publisher: Addison-Wesley Professional; 3rd edition (February 15, 2000)
Amazon.com Sales Rank, paperback: #11,797
Amazon.com Sales Rank, hardcover: #6,215
First edition, 327pp.
Amazon.com
Sales Rank, paperback: #1,243,918
Things we conclude: C++ is much more complex than C--274pp v 911pp. C++ keeps evolving: from 1986 to 2000, the book has had three editions, over which it has almost tripled in size. People are still buying the 1978 edition of K&R C because it's still correct; the first edition of Stroustrup is so incompatible with current C++ that people can't give it away. Finally, Prentice-Hall really needs to lower the price on the hardcover edition of K&R. I mean, my book is selling better than their hardcover, which ain't right.
Meanwhile, C is as stable as can be. Cyndi Lauper has put out seven albums since K&R C came out. The changes from first to 2nd ed. of K&R are pretty small--literally, they're a fine print appendix. And, I contend here, it owes its immense stability to Bjarne Stroustrup. With Bjarne putting out a new version of C++ every few years that frolics along with still more features, Prentice-Hall is free to reprint the same version of the C book without people whinging about how it's missing discussion of mutable virtual object templates. The guys who want simplicity and stability buy K&R and the guys who want niftiness and fun features buy Stroustrup and everybody's happy.
The other technical standard I use heavily is TEX, and I'd been meaning, for the sake of full disclosure, to give a critique of TEX comparable to this here critique of Word Fortunately, Mr. Nelson Beebe already did it for me, in this (PDF) essay entitled 25 Years of TeX and Metafont. The article alludes to exactly the sort of schism in typesetting as in general programming: you've got the people who are totally ignorant of standards and just want the shiniest new thing, and the people who built a standard system that has been stable for the better part of 25 years. Since he's on the standards-oriented team, he gives many examples of how such stability has led to large-scale projects that have significantly helped humanity.
His discussion of its limitations is interesting because there really are features that need to be added to TEX--notably, better support for non-European languages and easier extensibility. But “TEX is quite possibly the most stable and reliable software product of any substantial complexity that has every been written by a human programmer.” (p 15) Changing a code base that hasn't seen a bug in fifteen years is not to be taken lightly, and may never happen. Instead, we can expect to see a schism.
Evolution
In that 1986 edition of the C++ book, Bjarne wrote this: “since [two standards] will be used on the same systems by the same people for years, the differences should be either very large or very small to minimize mistakes and confusion.” I'm going to call this Bjarne's principle.
When you read about the raging debate between Blu-ray and HD DVD (I'm rooting for the one that isn't an acronym), don't think `now I have to worry about all my stuff being obsolete'. Thank those guys for distracting attention from DVD, which is a nice, stable format that hasn't changed in a decade, ensuring that your stuff has not become obsolete. People have made haphazard attempts to revise the CD format, but thanks to distractions like the MiniDisc and even DVD, your copy of Cyndi Lauper's first album is still the cutting-edge CD standard (specified in The Red Book, 1980). Attempts to incrementally tweak the CD standard never took off. Remember CD+G? If so, you're the only one.
So this is how conservatives evolve. Not from clean standards to floundering in pits of features, but revolutionary breaks from old clean standards to new clean standards. The feature pits are just distractions.
The process of evolution via incremental fixes directly breaks Bjarne's principle, because you get a stream of similar standards that are easily confused and comingled. Corporate-sponsored standards often suffer this failing (but not always), because setting standards that last for two decades and selling frequent updates are hard to reconcile. One company spent a while there naming its document standards with a year--standard '98, standard 2000, et cetera--which in my book means none of the formats are actually standard.
The only way to evolve while conforming to Bjarne's principle is to is to ride a system until it really doesn't do what you need anymore, and then revolt, building a new one that is clearly distinguished from the old, as we saw with DVD's overthrow of CD because CDs truly can not store movies, or Ω's eventual overthrow of TEX because TEX truly can not typeset Tamil.
The trick is to know when to revolt. When is a new feature so valuable that the old system should be abandoned? Many a dissertation has been written on this one, and I ain't gonna answer it here. But for well-thought-out technical standards, it's much later than you think, as demonstrated by the active 25-year old standards above.
Back to C vs C++
I copied Bjarne's principle from the first edition of his C++ book, so it comes as no surprise that in the mid-80s, C++ made an effort to conform to Bjarne's principle. In the present day, it just doesn't, and the confusion lies in thinking that it still does.
Even in the first edition, there are incompatibilities between C and the new C++, but just a page or so in the appendix. The author explicitly states ( 1st ed., p 5) that he's walking into a world of C programmers and C code everywhere, so retaining compatibility is sensible marketing and efficient.
But all those enthusiastically added features, that puffed the third edition up to nine hundred pages, each breaks a little something in raw C. To give a simple example, I use the variable name template a few times, and a user wrote me to tell me that his C++ compiler broke on that, because in C++ template is a reserved keyword. Bjarne's principle dies another little death.
On the other side, the ISO added a few features to C a decade ago. The most notable for me is designated initializers; I've written several entries here about how much you can get out of this syntactic tweak. However, C++ has no intention of supporting them. This author feels the rationale paper for not using designated initializers gives “arguments that aren't very convincing”, and I'd agree.
The restrict keyword, also added to C in 1999, does a lot to get code running faster. The authors of C++ have to date rejected the idea of supporting it. But because it's just optimization advice that can be taken or left, here is a valid rule for the parsing of this keyword: replace all instances of restrict with a blank space. With no serious technological reason to exclude restrict, we're left with just social and æsthetic reasons, and in the subjective balancing of issues, C compatibility and Bjarne's principle was clearly a low priority.
On a positive note, the last revision of C took a number of ideas from C++, after they'd been tested in C++'s feature pit for a few years, including the in-line comments with // which I use constantly and the inline keyword which I never use because the compiler will inline functions for you where appropriate. But in all cases, the rationale was because these features seemed useful and well-tested, not that adopting them would reduce the distance between the two languages.
All of these examples are to show you that modern C++ has basically thrown out Bjarne's principle. Many people still write “C/C++”, thinking of them as the same language, comfortably presuming that a C program will compile in a C++ compiler. But that hasn't been really true for maybe fifteen years now. Better would be to just acknowledge the schism. Let them drift further, because things can only get better once the pair are past confusion-maximizing near-similarity, leaving one well-set in its stability and one free to pursue novelty.
|
on Tuesday, December 1st, Sarah said What role do the developers of compilers have in all of this? Clearly there is the standard and then the implementation of the standard, which may be more or less correct. Compare this to HTML and web browsers where standards barely exist (http://www.joelonsoftware.com/items/2008/03/17.html). Is this a difference between compiled and interpreted languages? Probably not, because you don't see people fighting (much) over various Lisp interpreters. Perhaps the rule is if you want to have an interpreted language, it had better not be popular. Maybe constantly evolving standards keep certain companies' market shares enormous? |
