The statistics style report
It may sound like an oxymoron, but there is such a thing as fashionable statistical analysis. Where did this come from? How is it that our tests for Truth, upon which all of science relies, can vacillate from season to season like hemlines?
Before discussing those questions, let me tap on the brake, and point out that statistics as a whole is not arbitrary. The Central Limit Theorem is a mathematical theorem like any other, and if you believe the basic assumptions of mathematics, you have to believe the CLT. The CLT and developments therefrom were the basis of stats for a century or two there, from Gauss on up to the early 1900s when the whole system of distributions (Binomial, Bernoulli, Gaussian, t, chi-squared, Pareto) was pretty much tied up. Much of this, by the way, counts not as statistics but as probability.
Next, there's the problem of using these objective truths to describing reality. That is, there's the problem of writing models. Models are a human invention to describe nature in a human-friendly manner, and so are at the mercy of human trends. Allow me to share with you my arbitrary, unsupported, citation-free personal observations.
Number crunching
The first thread of trendiness is technology-driven. In every generation, there's a line you've got to draw and say `everything after this is computationally out of reach, so we're assuming it away', and the assume-it-away line drifts into the distance over time. Here's a little something from a 1939 stats textbook on fitting time trends (Arkin and Colton, 1939, p 43):
To fit a trend by the freehand method draw a line through a graph of the data in such a way as to describe what appears to the eye to be the long period movement. ...The drawing of this line need not be strictly freehand but may be accomplished with the aid of transparent straight edge or a “French” curve.
As you can imagine, this advice does not appear in more recent stats texts. In this respect, a stats text can actually become obsolete. But as time passes, approximations like this are replaced by new techniques that were before just written off as impossible. [Now reading: Hastie and Tibshirani (1990), who offer a few hundred pages on computational methods to do what was done by freehand above.]
Computational ability has brought about two revolutions in statistics. The first is the linear projection (aka, regression). Running a regression requires inverting a matrix, with dimension equal to the number of variables in the regression. A two-by-two matrix is easy to invert (¿remember all that about ad - bc?) but it gets significantly more computationally difficult as the number of variables rises. If you want to run a ten-variable regression using a hand calculator, you'll need to set aside a few days to do the matrix inversion. My laptop will do the work in 0.002 seconds. It's still in under a second up to about 500 by 500, but 1,000 by 1,000 took 8.9 seconds. That includes the time it took to generate a million random numbers.
So revolution number one, when computers first came out, was a shift from simple correlations and analysis of variance and covariance to linear regression. This was the dominant paradigm from when computers became common until a few years ago.
The second revolution was when computing power became adequate to do searches for optima. Say that you have a simple function to take in inputs and produce an output therefrom. Given your budget for inputs, what mix of inputs maximizes the output? If you have the function in a form that you can solve algebraically, then it's easy, but let us say that it is somehow too complex to solve via Lagrange multipliers or what-have-you, and you need to search for the optimal mix.
You've just walked in on one of the great unsolved problems of modern computing. All your computer can do is sample values from the function--if I try these inputs, then I'll get this output--and if it takes a long time to evaluate one of these samples, then the computer will want to use as few samples as possible. So what is the method of sampling that will find the optimum in as few samples as possible? There are many methods to choose from, and the best depends on enough factors that we can call it an art more than a science.
In the statistical context, the paradigm is to look at the set of input parameters that will maximize the likelihood of the observed outcome. To do this, you need to check the likelihood of every observation, given your chosen parameters. For a linear regression, the dimension of your task was equal to the number of regression parameters, maybe five or ten; for a maximum likelihood calculation, the dimension is related to the number of data points, maybe a thousand or a million. Executive summary: the problem of searching for a likelihood function's optimum is significantly more computationally intensive than running a linear regression.
So it is no surprise that in the last twenty years, we've seen the emergence of statistical models built on the process of finding an optimum for some complex function. Most of the stuff below is a variant on the search-the-space method. But why is the most likely parameter favored over all others? There's the Cramer-Rao Lower Bound and the Neyman-Pearson Lemma, but in the end it's just arbitrary. Gauss had no theorems that this framework gives superior models relative to linear projection, but it does make better use of computing technology.
Hemlines
The second thread of statistical fashion is whim-driven like any other sort of fashion. Golly, the population collectively thinks, everybody wore hideously bright clothing for so long that it'd be a nice change to have some understated tones for a change. Or: now that music engineers all have ProTools, everything is a wall of sound; it'd be great to just hear a guy with a guitar for a while. Then, a few years later, we collectively agree that we need more fun colors and big bands. Repeat the cycle until civilization ends.Statistical modeling sees the same cycles, and the fluctuation here is between the parsimony of having models that have few moving parts and the descriptiveness of models that throw in parameters describing the kitchen sink. In the past, parsimony won out on statistical models because we had the technological constraint.
If you pick up a stats textbook from the 1950s, you'll see a huge number of methods for dissecting covariance. The modern textbook will have a few pages describing a Standard ANOVA (analysis of variance) Table, as if there's only one. This is a full cycle from simplicity to complexity and back again. Everybody was just too overwhelmed by all those methods, and lost interest in them when linear regression became cheap.
Along the linear projection thread, there's a new method introduced every year to handle another variant of the standard model. E.g., last season, all the cool kids were using the Arellano-Bond method on their time series so they could assume away endogeneity problems. The list of variants and tricks has filled many volumes. If somebody used every applicable trick on a data set, the final work would be supremely accurate--and a terrible model. The list of tricks balloons, while the list of tricks used remains small or constant. Maximum likelihood tricks are still legion, but I expect that the working list will soon find itself pared down to a small set as optimum finding becomes standardized.
In the search-for-optima world, the latest trend has been in `non-parametric' models. First, there has never been a term that deserved air-quotes more than this. A `non-parametric' model searches for a probability density that describes a data set. The set of densities is of infinite dimension. If all you've got a hundred data points, you ain't gonna find a unique element of ℜ∞ with that. So instead, you specify a certain set of densities, like sums of Normal distributions, and then search for that subset that leads to a nice fit to the data. You'll wind up with a set of what we call parameters that describe that derived distribution, such as the weights, means, and variances of the Normal distributions being summed.
But `non-parametric' models allow you to have an arbitrary number of parameters. Your best fit to a 100-point data set is a sum of 100 Normal distributions. If you fit 100 points with 100 parameters, everybody would laugh at you, but it's possible. In that respect, the `non-parametric' setup falls on the descriptive end of the descriptive-to-parsimonious scale. In my opinion.
I don't want to sound mean about `non-parametric' methods, by the way. It's entirely valid to want to closely fit data, and I have used the method myself. But I really think the name is false advertising. How about distribution-fitting methods or methods with open parameter counts?
Bayesian methods are increasingly cool. If you want to assume something more interesting than Normal priors and likelihoods, then you need a computer of a certain power, and we beat that hurdle in the 90s as well, leaving us with the philosophical issues. In the context here, those boil down to parsimony. Your posterior distribution may be even weirder than a multi-humped sum of Normals, and the only way to describe it may just be to draw the darn graph. Thus, Bayesian methods are also a shift to the description-over-parsimony side.
Method of Moments estimators have also been hip lately. I frankly don't
know where that's going, because I don't know them very well.
Also, this
guy
really wants multilevel
modeling to be the Next Big Thing in the linear model world, and makes
a decent argument for that. He likes it because it lets you have a million
parameters, but in a structured manner such that we can at least focus
on only a few. I like him for being forthright (on the blog) that the
computational tools he advocates (in his books) will choke on large data
sets or especially computationally difficult problems.
Increasing computational ability invites a shift away from parsimony. Since PCs really hit the world of day-to-day stats recently, we're in the midst of a swing toward description. We can expect an eventual downtick toward simpler models, which will be helped by the people who write stats packages--as opposed to the researchers who caused the drift toward complexity--because they write simple routines that implement these methods in the simplest way possible.
So is your stats textbook obsolete? It's probably less obsolete than people will make it out to be. The basics of probability have not moved since the Central Limit Theorems were solidified. In the end, once you've picked your paradigm, not much changes; most novelties are just about doing detailed work regarding a certain type of data or set of assumptions. Further, those linear projection methods or correlation tables from the 1900s work pretty well for a lot of purposes.
But the fashionable models that are getting buzz shift every year, and last year's model is often considered to be naïve or too parsimonious or too cluttered or otherwise an indication that the author is not down with the cool kids--and this can affect peer review outcomes. A textbook that focuses on the sort of details that were pressing five years ago, instead of just summarizing them in a few pages, will have to pass up on the detailed tricks the cool kids are coming up with this season--which will in turn affect peer reviews for papers written based on the textbook's advice.
A model more than a few years old has had a chance to be critiqued while a new model has not. So using an old technique gives peer reviewers the opportunity to use their favorite phrase: the author seems to be unaware, in this case that somebody has had the time to find flaws in the older technique and propose a new alternative that fixes those flaws--while the new technique is still sufficiently novel that nobody has had time to publish papers on why it has even bigger flaws.
All this is entirely frustrating, because we like to think that our science is searching for some sort of true reflection of constant reality, yet the methods that are acceptable for seeking out constant reality depend on the whim of the crowd.
Please note: full references are given in the PDF version
[link][no comments]
git status interactive
One of the first things that struck me as nice about Git was the status command, which produces something just shy of a script for revising the status of all the files. It even gives you tips about how to do common tasks.
I got even more excited when I saw git rebase -interactive, which generates a semi-script, opens it for you to edit, and then runs the thing automatically. That was smooth.
So I expected there'd be a similar procedure like git status -interactive, which, if it existed, would work like this:
- You type git istatus.
- Your favorite editor opens. There, you see the output from git status, plus instructions for some basic commands: put an a at the head of a line to add a file, an i to ignore it from now on, an ea to edit then add (which you'll do if you're merging), an r to remove the file from the repository, and so on.
- You exit, and your instructions are run.
Git doesn't do that. So I wrote a demo script to make that happen, git-status-interactive.
Click that link to save the script to your hard drive, and make it executable via the usual chmod 755 git-status-interactive. You probably want to alias the script using Git's aliasing system. For example, to allow the git istatus command I'd shown above, try this command from your bash prompt, in a single git repository:
git config --add alias.istatus \!/your/path/to/git-status-interactive
Or if you have the permissions to make global changes to the git config:
git config --global --add alias.istatus \!/your/path/to/git-status-interactive
Some further notes
The script is a demo--dead simple, with no serious error checking. To some extent it's a feature request: Dear Git team, please implement something like this in Git, but competently. Also, dear readers, please drop me an email if you've improved this thing for the better.
By the way, Git does have git add -i, which behaves very differently from the edit-a-generated-file mechanism from git rebase -interactive. git add -i doesn't let me tick off files to ignore, and doesn't help immensely during merging; though it will give you more control when adding, like committing changes to sections of a file.
Apart from git status and the shell, I use exactly one program to make this happen: Sed. The prep step runs Sed to take in the output of git status and then remove non-comment lines and insert instructions; the post-editor step run Sed to replace the one-character markers with the full commands. That's all.
Because the modified file just runs as a shell script, you can add other commands as you prefer. For example, replacing the # at the head of the line with an rm turns it into a standard remove command, or you can mv a file that git complains is in the wrong place (probably due to merging issues), et cetera.
In case you missed the link in the text above, download
git status interactive
here.
[link][no comments]
The schism, or why C and C++ are different
Those of you who actually read my posts about efficient computing, rather than just going to read the comics at the first sight of the word `computing', may by now have noticed a few patterns.
The most basic is that standards are important. I know this sounds obvious to you, but if it's so obvious, why do people get it wrong so darn often. Why are people constantly modifying and violating standards that work just fine?
I know many of you have suspected this for a while, but let me state it loud and clear: I am conservative. Rabidly conservative. I think that people need to have a really good reason for not conforming to technical standards, and I think most people don't--they just use the shiniest thing available. A large amount of my writing on technical matters is simply pointing out that well-thought-out technical standards tend to work better than the newest and shiniest, and that the value of stability often more than makes up for inevitable flaws in the standards. Even my work on patents is aimed at making sure that open standards remain open and free to implement.
I originally tried to make this into an essay about both computing standards and general customs, but over the course of writing it, I came to realize that the two are fundamentally different. If somebody doesn't quite conform to your human customs--if they use the wrong fork or speak non-native English or wear ratty t-shirts to the office--then the person will be funny or diverse or annoying or just normal. Meanwhile, if computing standards aren't followed--if somebody gets sick of C's array notation, array[i][j], and decides it looks nicer as array[i, j]--then their writing is 100% gibberish and they might as well be speaking Hindu to an English-speaker. Standards-breaking in social settings can be fun; standards-breaking in computing is just breaking things.
So although I usually try to put something in the technical essays that will be interesting to those who could care less about machinery, I don't think any of the below is truly applicable to social norms. Or you can read on and decide for yourself.
Nor is this a comprehensive essay on standards drift and revolution, because that would take a volume or two. Just file this one as assorted notes on one question with an interesting proposed solution: what to do with all those people who keep trying to revise and update and modify the standards?
Schisms
Intuitively, there's the English-teacher approach to retaining a standard, where we force everybody to stay in line with the basic standard. When you go home to write your pals, your English teacher instructed you, be sure to use perfect grammar at all times.
But another approach is to let the whippersnappers fork. On the face of it, it may seem contradictory to think that splitting a standard in half would somehow make it purer, but under the right conditions, giving those who want to experiment room to do so can be the best approach.
For any technological realm, you've got one set of people who just want features--lots and lots of features, enough to wallow in like they're a bed of slightly moist hundred dollar bills--and you've got another team that wants fewer moving parts, and takes care to maintain discipline and stick to the existing norms. We can bind the two teams together, in which case they will constantly be fighting over little modifications to the system and neither team will be happy. That's what happens with English. Or you can have the schism.
Allow me to cut and paste from Amazon:
The C Programming Language
by Brian W. Kernighan, Dennis
M. Ritchie
274 pages
Publisher: Prentice Hall PTR; 2nd edition (March 22, 1988)
Amazon.com Sales Rank, paperback: #4,457
Amazon.com Sales Rank, hardcover: #445,546
First edition
228pp, 1978:
Amazon.com Sales Rank, paperback: #60,113
The C++ Programming Language
by Bjarne Stroustrup
911 pages
Publisher: Addison-Wesley Professional; 3rd edition (February 15, 2000)
Amazon.com Sales Rank, paperback: #11,797
Amazon.com Sales Rank, hardcover: #6,215
First edition, 327pp.
Amazon.com
Sales Rank, paperback: #1,243,918
Things we conclude: C++ is much more complex than C--274pp v 911pp. C++ keeps evolving: from 1986 to 2000, the book has had three editions, over which it has almost tripled in size. People are still buying the 1978 edition of K&R C because it's still correct; the first edition of Stroustrup is so incompatible with current C++ that people can't give it away. Finally, Prentice-Hall really needs to lower the price on the hardcover edition of K&R. I mean, my book is selling better than their hardcover, which ain't right.
Meanwhile, C is as stable as can be. Cyndi Lauper has put out seven albums since K&R C came out. The changes from first to 2nd ed. of K&R are pretty small--literally, they're a fine print appendix. And, I contend here, it owes its immense stability to Bjarne Stroustrup. With Bjarne putting out a new version of C++ every few years that frolics along with still more features, Prentice-Hall is free to reprint the same version of the C book without people whinging about how it's missing discussion of mutable virtual object templates. The guys who want simplicity and stability buy K&R and the guys who want niftiness and fun features buy Stroustrup and everybody's happy.
The other technical standard I use heavily is TEX, and I'd been meaning, for the sake of full disclosure, to give a critique of TEX comparable to this here critique of Word Fortunately, Mr. Nelson Beebe already did it for me, in this (PDF) essay entitled 25 Years of TeX and Metafont. The article alludes to exactly the sort of schism in typesetting as in general programming: you've got the people who are totally ignorant of standards and just want the shiniest new thing, and the people who built a standard system that has been stable for the better part of 25 years. Since he's on the standards-oriented team, he gives many examples of how such stability has led to large-scale projects that have significantly helped humanity.
His discussion of its limitations is interesting because there really are features that need to be added to TEX--notably, better support for non-European languages and easier extensibility. But “TEX is quite possibly the most stable and reliable software product of any substantial complexity that has every been written by a human programmer.” (p 15) Changing a code base that hasn't seen a bug in fifteen years is not to be taken lightly, and may never happen. Instead, we can expect to see a schism.
Evolution
In that 1986 edition of the C++ book, Bjarne wrote this: “since [two standards] will be used on the same systems by the same people for years, the differences should be either very large or very small to minimize mistakes and confusion.” I'm going to call this Bjarne's principle.
When you read about the raging debate between Blu-ray and HD DVD (I'm rooting for the one that isn't an acronym), don't think `now I have to worry about all my stuff being obsolete'. Thank those guys for distracting attention from DVD, which is a nice, stable format that hasn't changed in a decade, ensuring that your stuff has not become obsolete. People have made haphazard attempts to revise the CD format, but thanks to distractions like the MiniDisc and even DVD, your copy of Cyndi Lauper's first album is still the cutting-edge CD standard (specified in The Red Book, 1980). Attempts to incrementally tweak the CD standard never took off. Remember CD+G? If so, you're the only one.
So this is how conservatives evolve. Not from clean standards to floundering in pits of features, but revolutionary breaks from old clean standards to new clean standards. The feature pits are just distractions.
The process of evolution via incremental fixes directly breaks Bjarne's principle, because you get a stream of similar standards that are easily confused and comingled. Corporate-sponsored standards often suffer this failing (but not always), because setting standards that last for two decades and selling frequent updates are hard to reconcile. One company spent a while there naming its document standards with a year--standard '98, standard 2000, et cetera--which in my book means none of the formats are actually standard.
The only way to evolve while conforming to Bjarne's principle is to is to ride a system until it really doesn't do what you need anymore, and then revolt, building a new one that is clearly distinguished from the old, as we saw with DVD's overthrow of CD because CDs truly can not store movies, or Ω's eventual overthrow of TEX because TEX truly can not typeset Tamil.
The trick is to know when to revolt. When is a new feature so valuable that the old system should be abandoned? Many a dissertation has been written on this one, and I ain't gonna answer it here. But for well-thought-out technical standards, it's much later than you think, as demonstrated by the active 25-year old standards above.
Back to C vs C++
I copied Bjarne's principle from the first edition of his C++ book, so it comes as no surprise that in the mid-80s, C++ made an effort to conform to Bjarne's principle. In the present day, it just doesn't, and the confusion lies in thinking that it still does.
Even in the first edition, there are incompatibilities between C and the new C++, but just a page or so in the appendix. The author explicitly states ( 1st ed., p 5) that he's walking into a world of C programmers and C code everywhere, so retaining compatibility is sensible marketing and efficient.
But all those enthusiastically added features, that puffed the third edition up to nine hundred pages, each breaks a little something in raw C. To give a simple example, I use the variable name template a few times, and a user wrote me to tell me that his C++ compiler broke on that, because in C++ template is a reserved keyword. Bjarne's principle dies another little death.
On the other side, the ISO added a few features to C a decade ago. The most notable for me is designated initializers; I've written several entries here about how much you can get out of this syntactic tweak. However, C++ has no intention of supporting them. This author feels the rationale paper for not using designated initializers gives “arguments that aren't very convincing”, and I'd agree.
The restrict keyword, also added to C in 1999, does a lot to get code running faster. The authors of C++ have to date rejected the idea of supporting it. But because it's just optimization advice that can be taken or left, here is a valid rule for the parsing of this keyword: replace all instances of restrict with a blank space. With no serious technological reason to exclude restrict, we're left with just social and æsthetic reasons, and in the subjective balancing of issues, C compatibility and Bjarne's principle was clearly a low priority.
On a positive note, the last revision of C took a number of ideas from C++, after they'd been tested in C++'s feature pit for a few years, including the in-line comments with // which I use constantly and the inline keyword which I never use because the compiler will inline functions for you where appropriate. But in all cases, the rationale was because these features seemed useful and well-tested, not that adopting them would reduce the distance between the two languages.
All of these examples are to show you that modern C++ has basically thrown out Bjarne's principle. Many people still write “C/C++”, thinking of them as the same language, comfortably presuming that a C program will compile in a C++ compiler. But that hasn't been really true for maybe fifteen years now. Better would be to just acknowledge the schism. Let them drift further, because things can only get better once the pair are past confusion-maximizing near-similarity, leaving one well-set in its stability and one free to pursue novelty.
|
on Tuesday, December 1st, Sarah said What role do the developers of compilers have in all of this? Clearly there is the standard and then the implementation of the standard, which may be more or less correct. Compare this to HTML and web browsers where standards barely exist (http://www.joelonsoftware.com/items/2008/03/17.html). Is this a difference between compiled and interpreted languages? Probably not, because you don't see people fighting (much) over various Lisp interpreters. Perhaps the rule is if you want to have an interpreted language, it had better not be popular. Maybe constantly evolving standards keep certain companies' market shares enormous? |
Yet another git tutorial
Git is a revision control system, meaning that it is designed to keep track of the many different versions of a project as it develops, such as the stages in the development of a book, a tortured love letter, or a program.
Here's a typical story: you begin chapter one, and commit it to the repository. Your coauthor is working on chapter two, and does the same. Tomorrow, you pull out the current version of the repository, and both chapters are now on hand for you to revise. Later, your coauthor calls and tells you that she was robbed and used her laptop to block a bullet, and you reassure her that it's no problem, because the draft is safely stored in the repository. She gets a new laptop, checks out the current state of the project, and is back to revising your and her work so far.
I typically put even small solo projects under revision control, because it makes me a better writer/coder: I'm more confident deleting things when I know that they're safely stored should I want them back. Git makes this easy, as you'll see below.
Git's history is relevant to its use. It was written by the guy who originally wrote Linux, Linus Torvalds, with the intent of supporting Linux development. Linus is a communist in the best possible way, and thus pushes Git at a means of easy collaboration among equals. So if you like the idea of collaborative development, and especially if you're a computer geek, then Git will enthuse you.
To me, the most interesting thing about Git is just how many tutorials there are about it. It's a complex system, and people have interesting reactions to it. Some tutorials break through the complexity by just giving to do this, type this instructions; others get enthusiastic and effuse about the clever structure of the system more than showing you how to use it. For my purposes, I need something to explain the underlying concepts, and not condescend to the reader, but not confuse the story with all the details that are basically irrelevant to those who don't need to create clean patches (or know what a patch is) and don't see a need to cryptograpically sign our book chapters.
I will assume that you are familiar with the basics of POSIX directories, files, text editing, and the usual basic commands like cp, mv, rm, and so on.
The structure
A revision control system does two difficult tasks: it has to organize a pile of slightly different versions of a project, and it has to merge together revisions, say when you and a pal are separately working on the same project and finally come together.
You can't merge in changes from a different version or check out an old version until you know how to find that version in Git's hierarchy, so we will begin there. You will see that this repository-branch-version-file hierarchy will determine much of how everything is done.
- There may be several repositories.
- Each repository holds several branches.
- Each branch holds several versions (aka commit objects).
- At any one time, you are looking at exactly one version of the project, and the several files that comprise that version.
The repository
The repository is a pile of versions, in a binary format that Git can read and you can't. Here is how you would create a new one, in a given directory that will be the repository from then on:
mkdir new_project cd new_project git init
You now have a blank repository. If you are putting an existing project under revision control, you will need to add existing files to the index with git add .; see below. Git stores all its files in a directory named .git, where the dot means that all the usual utilities like ls will take it to be hidden; after the init step, you can look for it via, e.g., ls -a.
This is the first embodiment of Linus's egalitarian communism: he wanted to make it easy to create lots of repositories on lots of machines. The key to this is that the .git subdirectory holds all the information, so copying your project directory (including the .git subdirectory) generates a new, entire respository. If you want to back up your project and repository, just recursively copy your project directory:
cp -r new_project project_bkup
Versions
Now create a new file for yourself, using your favorite text editor or what-have-you.1 Git doesn't know about this file yet; you have to tell the machine about it using the index (discussed below). For now, try
git add .
to add to Git's index everything in the current working directory (in UNIX-speak, .).
Now you can save your work to the repository, or in revision control jargon, commit your changes:
git commit -a -m "What I've been doing up until this check-in."
You can make more changes to your text file, and then re-run git commit -a -m "..." as often as you wish, thus creating a history of commits that you'll be able to refer to below. Notice that once you've put a file into the index, you don't need to re-run git add ....2
I've been avoiding the term commit objects as being a little too jargon, but the jargon does get across the idea of a single committed blob, which is treated as a unit for our purposes. You can use git log to get the list of commits in your history. The log shows two relevant pieces of information: the 40-digit SHA1 cryptographic hash, and the human-language message you wrote when you did the commit. The SHA1 hash is a computer-scientist clever means of solving several problems, and is the name you will use for the commit. Fortunately, you need only as much of it as will uniquely identify your commit. So if you want to check out revision #fe9c49cddac5150dc974de1f7248a1c5e3b33e89, you can do so with
git checkout fe9c
With that command, you've gone back in time, and have in your current directory whatever you had back then. Take notes, copy off that paragraph you wish you hadn't deleted, then
git checkout master
to return to the head of the master branch (which is where you started off, being that we haven't discussed creating new branches yet).
I suggested that you take notes and make observations, but not that you change anything. ¿What would happen if you were to build a time machine, go back to before you were born, and kill your parents? If we learned anything from science fiction, it's that if we change history, the present doesn't change, but a new alternate history splinters off. So if you check out an old version, make changes, and check in the changed version, then you've created your first branch off of the master branch.3
Git is designed to make it as easy as possible to bounce back to an alternate version and bounce back to where you were, as often as you need. But this may not be ideal, because you may want to have both versions living side-by-side. The easiest way to do this is to just make yourself a second repository.
cp -R /path/to/maindir new_tempdir cd new_tempdir git checkout fe9c
Now you can do side-by-side comparisons between the main version in the main repository and your disposable copy. There's also a git clone command that does about the same as this in a slightly slicker manner.
I've found the ability to quickly jump around in time to be immense fun, and has made me a more confident editor. When in doubt, I make the cut, and know that the worst punishment for an error is the small bother of checking out an old version.4
The index versus your files
We've looked at prior check-ins; now ¿what will the next check-in look like? Git maintains what is called the index, which is the nascent list of files that will become a commit object when you next call git commit. That index is not identical to the files you see when you do a directory listing, for a few reasons. First, many systems produce annoying files like log files, object code, and other mid-processing cruft, and you don't want those taking up space in the repository. There are also advanced commit strategies wherein you may change several files, but want to save only the changes you'd made in one or two.
Regardless of the rationale, bear in mind that the working directory you are looking at is probably a mix of files Git is tracking and files Git doesn't care about. If you want to add a new file to the repository, remove an old one, or fix the name of a file, you'll need to do one of:
git add newfile git rm oldfile git mv flie file
so that the change is evident both in the working directory and in the index. As for files that you are modifying but are not shuffling around in the filesystem, you technically have to add those one by one as well by running git add modified_file with every single commit, but the -a in the command git commit -a tells git to save all modifications on known files (including removal), thus saving you all that tedium. In my own work, I have never encountered a reason to commit without the -a flag, but perhaps you may one day run across something.
Branches
To this point, you have been alternating between doing work and saving it via git commit -a, thus producing a sequence of committed versions of your program. That's a branch.
Perhaps thread would be a better term, being that this represents a single thread of your work conversation. By default, you are on a branch named master. Other threads come up in two manners: your own work may digress, or you may have colleagues who are following their own threads. The typical story in your own work would be that you are trying something speculative, which may or may not work. By creating a new branch, you are ensuring that you have something stable in the master branch at all times; you can merge the experimental branch back into the master thread later if all works out (where merging will be covered below).
¿What branch are you on right now? Find out with
git branch
which will list all branches and put a * by the one that is currently active.
Now create a new branch. There are two ways to do it:
git branch new_leaf git checkout new_leaf #or equivalently: git checkout -b new_leaf
There are really two steps here: establishing a new branch in the repository, then changing your working version to make use of that branch. The two-command version makes that explicit; the single checkout -b form is provided because it's so common to want to immediately use the branch that you are creating.
Having built a new branch, you can switch between branches easily. E.g., to switch back to the master branch:
git checkout master
You can see that the branch checkouts, like git checkout newleaf or git checkout master, have the same syntax as that infernal SHA1 syntax like git checkout fe9c. The reason is that the name of the branch is really just a synonym for the SHA1 hash that is the last item on that branch (aka the branch head). Use them interchangeably, though I'm guessing you'll lean toward using the branch name where possible.
Merging
To this point, everything has been about creating new versions, and jumping around between versions. Now for the hard part: you have a version, your colleague has a version, and they differ.
The command is simple enough. You have on your screen a current version, and you want to fold in the revisions from version fe9f. Then
git merge fe9f
will do the work. Of course, you can use a branch name as well, like git merge new_leaf.
For some things, the system will have no problem merging together the two threads: if your coauthor was working only on the intro to chapter three and you were working only on chapter three's conclusion, that's easy to merge.
But if you were both wrestling with the same paragraph, then the computer will be confused. It will tell you that there are conflicts, and write both into the file in your current directory. You will then have to open the file(s) in your text editor, and find the place where git wrote both versions for you to compare and choose from.
The other type of conflict, which is just annoying, is when your colleague has renamed a file or moved it from one directory to another. Git typically won't just move the darn file for you, but will instead list it as a conflict for you to deal with. Moving files can create other awkward issues; for example, if you are doing git pull from a subdirectory that your coauthor has deleted, you'll get entirely confused errors.
Here is the procedure for committing merges:
- git merge a_branch.
- Get told that there are conflicts you have to resolve.
- Check the list of unmerged files using git status.
- Pick a file to manually check on. Open it in a text editor and find the merge-me marks if it is a content conflict; move it into place if it's a file name or file position conflict.
- git add your_now_fixed_file.
- Repeat steps 3-5 until all unmerged files are checked in.
- git commit to finalize the merge.
I have always found merging to be unnerving. There is a computer modifying your files, without even telling you what it is modifying. Unlike the long list of versions and your endless power to shunt branches, the merge algorithm is more-or-less a black box, and you just have to trust it. In that context, it's a somewhat good thing that the machine sometimes refuses to auto-merge and demands human attention. If the computer does go too far and makes a total mess of things, you can take recourse knowing that you have the previous version safely stowed.
The stash
It doesn't take long working with Git to discover that it doesn't like doing anything when there are uncommited changes in the current working directory. It typically asks you to commit your work, and then do the checkout or such that you had intended.
One thing you can do in this case is a variant of the merge routine above: ask git status which files are tracked but modified; git add those files; then git commit your changes. Once your working tree, the index, and the latest commit are all in harmony, you can go back to your original plan.
Another sometimes-appropriate alternative is git reset -hard, which takes the working directory back to where you had last checked out. If the command sounds severe, it is because you are about to throw away all work you had done since the last checkout.5
The other option is the stash, a quick-to-use branch, with a few special features, like retaining all the junk in your working directory. Here is the typical procedure:
git stash git checkout newleaf #or another commit, or what-have-you #do work here git checkout master #or the branch you had stashed from git stash pop
So this is the above procedure of checking out, doing work, and then returning to the current version, but you stash your in-progress working directory beforehand and pop it back into place afterward.
Popping the stash works by merging the stash's semi-branch back to whatever is currently checked out, which is why you have to check out the commit you had been one before going exploring in the history: doing the merge is trivial if you have checked out the commit that you had been on when you started, and could be a mess if you are elsewhere. The ability to pop the stash onto a separate commit allows for creative merging strategies which you may find use for if you are feeling clever.
Remote repositories
So far, I've talked only of checking out versions and branches that are in the respository of the directory you are in right now. But you can copy branches across repositories, which is how sharing happens. As alluded above, there can be amusing reasons for cloning a directory to another directory, and then merging changes between them.
To do all this, you need to be able to copy a branch from another repository. And to do that, you will need to name the other repository. Do this via
#for an on-disk remote repository: git remote add my_copy /path/to/copy #and for a distant remote repository: git remote add distant_version http://...
That is, you will give a nickname for the repository, then a locator. There are many options for locators; the odds are good that the maintainer of any given repository handed you a locator to use, so I won't bore you with a list of options here.
A plain git remote will give you a list of remote repositories your repository knows about. You will probably just assign one remote and be done with it, but you have the power to live Linus's dream of concurrently passing files among several of your peers' several repositories.
Having established a remote, you also have more branches to choose from: try git branch -r to list remote branches (or git branch -a to list all branches, local and remote).
There are a few ways to get a remote branch:
git checkout -b new_local distant_version/master #or git pull distant_version master
The first version uses the plain checkout mechanism using the remote tag you got via git branch -r, and uses the -b flag to create a name for the new branch you are about to create (otherwise you'd be stuck on (no branch)). The pull version merges into whatever you are working on now. You are probably getting things from the repository to bring your own work back up-to-date and in sync, so you probably want to use pull instead of checkout.
The converse is push, which you'll use to update the repository with your last commit (not the state of your index or working directory).
git push distant_version
You may need to pull first, such as when somebody had made another commit to the repository while you were working. So you will first need to do another git pull, slog through the merging procedure, and then push back the merged result.
git help
That's all I'm giving you, and it should be enough for you to keep versions of your work, confidently delete things, merge in your colleagues' work, and be able to keep your bearings in Git's repository/branch/version/index system. From there, git help and your favorite search engine will teach you a whole lot of ways to doing these things more smoothly, and many of the tricks that I didn't cover here.
Footnotes
- ... what-have-you.1
- Git, like every revision control system I know, works best around the POSIX paradigm of relatively short lines of text. Computer code naturally looks like this, as does typical human-written plain text (with linewrap enabled so you don't have one line per paragraph). If you're using a binary format like a word processor document, then Git will have trouble making meaningful merges, and you're basically stuck using whatever revision control the word processor vendor gave you (if any).
- ... ....2
- Much of git's advanced technique is about rewriting the history to produce a smoother course of events. This document makes a point of not worrying about the history, but I will mention one nice feature to keep your log from filling up with small commits. After you commit, you will almost certainly slap your forehead and realize somthing you forgot. Instead of just doing another commit, you can do git commit -amend -a to add to your last commit.
- ... branch.3
- You haven't named your branch, which can create problems. Getting ahead of the story, if you ever call git branch and find that you are on (no branch) then you can run git branch -m new_branch_name to name the branch you've just splintered off to.
- ... version.4
- See also, e.g., git show fe9c:oldfile to just display a single file from an old version without doing a full checkout.
- ... checkout.5
- By the way, you can reset individual files by just checking them out. I recommend this syntax: git checkout - one_file, after which one_file has reverted to its state as of the last checkin, and all changes lost.
|
on Monday, November 9th, Chandra Erdman said funny and informative. thanks! |
Easy re-typing with designated initializers
This column is about dealing with multiple formats for the same thing. To give the simplest example, consider a plain old list of numbers. The raw representation is an array, where the numbers are a sequence in memory. But then you don't know how long the thing is, so you need to also have a note somewhere as to its size. Maybe you want to name it, or treat it like 2-D matrix. Next thing you know, you've got a long list of extra data taped to that simple list.
Now you've got design problems. In terms of the systems I work with, you've got several levels of intent and complexity, including the simple double *, the gsl_vector*, the gsl_matrix*, and the apop_data* structures, any of which could be used to represent a few numbers.
These different structures aren't just there for fun: a scalar doesn't necessarily behave like a 1×1 matrix or a one-unit vector.
- a scalar times a N×1 column vector is usually read to produce a scaled N×1 vector.
- A 1×1 matrix ⋅ a N-unit vector is a similar scaling operation, but here we'll have to assume that the vector is a row vector, and the output will be either a 1×N matrix or a N-element vector understood to be a row, depending on custom.
- A vector dot a vector is usually taken to mean a row vector x dot a column vector y, producing x1y1 + x2y2 + ... + xnyn. But a vector of length one doesn't match dimensions with a vector of length N > 1, so in this case we'd just throw an error.
There are already a lot of subtleties, like whether we want to be explicit about whether a vector is a row or a column, or just assume that it'll do whatever is needed to conform, or whether the output wants to be a scalar, vector, or matrix.
Dealing with complexity
These different, sort of overlapping types are necessary, but they inevitably add complexity to the system. There are some methods for dealing with these different types, all of which have their benefits and bugs.
- Just make everything the most inclusive structure. Pros: users don't have to think: everything is a named N-dimensional frame of long long double-precision floating-point numbers, labeled with arbitrary-length strings, and there's no need to worry about sub-types and such. Cons: writing down the number 14 is now a massive production. Now you couldn't distinguish a scalar from a 1×1 matrix if you wanted to.
- Overload functions, so a function can take any representation of a list of numbers as input, and handles the differences internally. This gives surface ease, because the function user usually doesn't have to think too hard about types. But if it's a double* you still need to remember to send in an extra length parameter, and it's hard to encode the above scalar/vector/matrix subtleties into such a system, because you're never quite sure how a function will read your inputs. The bugs produced by subtle differences like these are, in my experience, among the most difficult to debug.
- Have the user do the type-casting between things: pulling smaller elements out of the larger structs, and building purpose-built parent structures to wrap the smaller stuff. Cons: you need to know the structures, and have to do the work of explicitly stating things. Pros: the process of subsetting takes zero computer time, and the process of wrapping is not necessarily annoying, as discussed below.
None of these methods are ideal, and which devil you choose is a matter of local considerations, practical issues, and personal taste. I gravitate toward the third, wherein the user is expected to know the darn underlying hierarchy of types, and deal accordingly. Why? Because I've found that systems that hide that hierarchy from you do a lousy job of it. In case this essay isn't long enough, have a look at this essay on the law of leaky abstractions, which explains that you can get away with not explicitly acknowledging the different types for a while, but eventually you're going to have to confront the differences. With good technique, it's not hard to switch types on the fly.
Figure One: The double* to gsl_matrix/_vector to apop_data hierarchy.
Making it easy
Getting elements out of a structure is pretty easy: just point to it. Because you can have multiple pointers pointing to the same thing, it's easy to rename something that is deeply nested inside the hierarchy. E.g., an apop_model (which would float above the type diagram above) holds parameters in an apop_data set, which holds a gsl_vector, which holds a list of doubles. So:
double *list = my_model->parameters->vector->data; do_something(list); do_more(list[3]);
Since the new name is just a spare pointer to the same data, all changes to the data (without moving the pointers themselves) happen as expected, you didn't copy any data, and you don't have to free anything at the end. Clean and simple.
Going up the hierarchy is more difficult, because you need to add all that extra data yourself; one paragraph was enough to cover going down the tree, and the rest of this column will be about going up the tree. I'll start from the most verbose, and work my way toward the easier methods, so don't get discouraged by the part where I use ten lines to take a dot product--I'll have it back down to one in the end. As we C users like to say, there are an infinite number of ways to do it.
There are functions to wrap things. For example, the apop_dot function has a quite clean syntax that takes in two apop_data structs (plus optional parameters indicating transposition), but if your data isn't in that input format, you'll need to wrap it. Here's an example where we know that we're multiplying a 3×2 matrix against a two-element column vector, using a function to copy data to the right structure:
apop_data *a_dot(double *set1, double *set2){
apop_data *d1 = apop_line_to_data(set1, 0, 3, 2);
apop_data *d2 = apop_line_to_data(set2, 2, 0, 0);
apop_data *out = apop_dot(d1, d2);
apop_data_free(d1);
apop_data_free(d2);
return out;
}
If you're just doing this once, the deallocations at the end may be optional, but if you're writing a function to be called a million times, they'll become essential.
For matrices or vectors, you could produce a dummy wrapper and then point to the data. But don't forget to unlink before calling the free function, lest you lose the original data:
apop_data *another_dot(gsl_vector *v, gsl_matrix *m){
apop_data *dummy1 = apop_data_alloc(0,0,0);
apop_data *dummy2 = apop_data_alloc(0,0,0);
dummy1->vector = v;
dummy2->matrix = m;
apop_data *out = apop_dot(dummy1, dummy2);
dummy1->vector = NULL;
dummy1->matrix = NULL;
apop_data_free(dummy1);
apop_data_free(dummy2);
return out;
}
That is unabashedly a lot of work for one dot product.
The first way in which we can save the trouble of deallocating is to use the static keyword to guarantee that a shell will always be on hand to fill. If you're not familiar with static variables, see pp 39-40 of Modeling with Data.
I do this sort of thing so often that I even have a convenience macro to simplify the process.
#define Staticdef(type, name, def) static type name = NULL; \
if (!(name)) name = def;
apop_data *easier_dot(gsl_vector *v, gsl_matrix *m){
Staticdef(apop_data*, dummy1, apop_data_alloc(0,0,0));
Staticdef(apop_data*, dummy2, apop_data_alloc(0,0,0));
dummy1->vector = v;
dummy2->matrix = m;
return apop_dot(dummy1, dummy2);
}
The next step in the chain is to just produce that dummy structure on the fly, which is where designated initializers come in.
apop_data *easiest_dot(gsl_vector *v, gsl_matrix *m){
apop_data dummy1 = {.vector = v};
apop_data dummy2 = {.matrix = m};
return apop_dot(&dummy1, &dummy2);
}
What just happened: we used designated initializers p 32 of Modeling with Data to allocate a structure and fill one element. The elements not explicitly mentioned are zero, so we don't have to worry about them. This works for the apop_data structure because it is designed to be OK with being mostly empty; below we'll see some structures that are a bit more needy.
That trick allocated an apop_data struct, but you'll notice that every library function takes in a pointer: apop_data*. This distinction is why we need to use &dummy1 instead of just dummy1 when making the function call. But this setup means that we don't have to deallocate anything at the end: the structure is cleaned up automatically when the function exits.
Some people are lines-of-code averse, and really hate the idea of having those extra lines of code producing extra structures. So, just do it in place:
apop_data *one_line_dot(gsl_vector *v, gsl_matrix *m){
return apop_dot(&((apop_data) {.vector = v}), &((apop_data) {.matrix = m}));
}
I like the three-line form better, myself, partly because I need the (apop_data) type cast when not on the declaration line. Maybe some macros will clean up the second form:
#define d_from_v(v) &((apop_data) {.vector = v})
#define d_from_m(m) &((apop_data) {.matrix = m})
apop_data *one_line_dot(gsl_vector *v, gsl_matrix *m){
return apop_dot(d_from_m(m), d_from_v(v));
}
Going from a raw array to the GSL's vectors and matrices require a little more care, because you'll need to add some metadata: the number of rows/columns, and the requisite jumps.
apop_data *a_dot_again(double *set1, double *set2){
gsl_vector m = {.data = set1, .size1=3, .size2=2, .tda = 3};
gsl_vector v = {.data = set2, .size=2, .stride = 1};
return apop_dot(d_from_m(m), d_from_v(v));
}
The tda (trailing dimension of array) and stride elements tell the system how to convert the 1-D layout in memory into the right shape. For subvectors and submatrices, the jumps may take different forms, but for our purposes, the tda is always equal to the row size, and the stride is always one. With that in mind, we can wrap these details in macros, and daisy-chain it all together:
#define v_from_a(v, size) &((gsl_vector) {.data = (v),\
.size =(size), .stride = 1})
#define m_from_a(m, size1, size2) &((gsl_matrix) {.data = (m),\
.size1 =(size1), .size2=(size2), .tda = (size1)})
apop_data *a_dot_again(double *set1, double *set2){
return apop_dot(d_from_m(m_from_a(set1, 3, 2)), d_from_v(v_from_a(set2, 2)));
}
In the end, you're still going to have to climb your way up the hierarchy a few steps for this array-to-data case to work. It's up to you if you want to take that last step and write a more macros:
#define dv_from_a(a, size) d_from_v(v_from_a(a, size)) #define dm_from_a(a, size1, size2) d_from_m(m_from_a(a, size1, size2))
All of these macros are cheap, in the sense that they allocate short structures and don't copy any of your data. Also, they're a whole lot shorter than the ten-line version.
On the con side, I think there exist people who would call them bad style, because you're not using the formal methods of allocation (e.g., gsl_vector_alloc), and are thus bypassing checks that things are OK. Situations that depend on those ignored structure elements having non-NULL values may surprise you in odd cases.
There's the problem that by skipping the setter functions, you're assuming knowledge of the internal structure of the struct that shouldn't be your problem--which is true: you the user shouldn't really have to care about tdas. At least you can look those details up once and hide them in a macro. This argument usually continues that the underlying structures might change as the designers come up with new ideas, but this is not seriously an issue. Early on, structures change, but at this point, the GSL and even Apophenia have a sufficiently large base of users that arbitrarily screwing around with core structures is a social impossibility. So, frankly, the macros here are not as bad form as the textbooks say they are.
Summary paragraph: There's real benefit to having different types: a scalar is just not a 1×1 matrix or a one-item vector, so we need to be able to specify all these structures. But, as a direct corollary, we need to be able to easily jump between structures as necessary. In this column, I gave you nine examples of how to take a dot product, depending on the inputs. Our pals designated initializers and compound literals saved the day, because they let us set up a quick structure, fill it, and use it without worrying about memory and deallocation. You can apply these tricks in a variety of situations; for those of you who might follow exactly the array-to-matrix-to-data forms above, here are all the macros I used in one cut-and-pasteable block:
#define Staticdef(type, name, def) static type name = NULL; if (!(name)) name = def;
#define d_from_v(v) &((apop_data) {.vector = (v)})
#define d_from_m(m) &((apop_data) {.matrix = (m)})
#define v_from_a(v, size) &((gsl_vector) {.data = (v), .size =(size), .stride = 1})
#define m_from_a(m, size1, size2) &((gsl_matrix) {.data = (m), .size1 =(size1), .size2=(size2), .tda = (size1)})
#define dv_from_a(v, size) d_from_v(v_from_a(v, size))
#define dm_from_a(m, size1, size2) d_from_m(m_from_a(m, size1, size2))
[link][no comments]
Computing history and its scars
This history of computing is a catalog of scars. Glitches that software ran into when it was around ten years old still leave creaks and aches today.
Punch cards
We'll start with punch cards, because that's what they had when FORTRAN was developed in the mid-1950s.
FORTRAN code, going by the 1977 standard, is based on the format of a punch card: the first six columns are reseved for labels, and a mark in the seventh indicates a continuation from last line, which is necessary because anything after the 72nd character in a line is ignored, because that's how wide punch cards were circa the 1950s. FORTRAN later dropped these requirements, but here in 2009, I still deal with active, working code that is bound by the `77 standard for conforming with paper cards first standardized in 1928.
Figure One: I still need to follow this format when I work on our
64-processor server.
Input via cards is divided into decks. You have a declaration deck, listing variables and registers which will be used. Another deck initializes constants. Each subroutine will be another deck. Decks should remain as independent as possible.
You can tell a punch card language by this division into decks. COBOL is the big winner on this, with a number of decks that are basically required for every program: identification division, environment division, input-output section, data division, file section, working-storage section, procedure division, and these section titles alone are already longer than a lot of Perl programs.
As noted, modern Fortran really looks nothing like FORTRAN from the mid-century, although Fortran users do still have an annoying custom of writing their code in all caps. The division into decks, however, is still basically there. SAS, for those of you who have the $$$ to use it, hasn't changed at all in its rhythm, requiring that every thought be broken down into a separate deck: CARD ...END CARD; DATA ...END DATA; PROC ...RUN; PROC ...RUN.
Teletype
Let us step forward to the mid-60s, and the teletype (aka, the line printer). The teletype's unit of analysis, comparable to the punch card, is the line of text. Bell Labs evidently had a run on these things, because in the early 1970s, they solidified a lot of what we use today, notably C and UNIX, with the teletype interface in mind.
C and UNIX are heavily dependent on plain U.S. English text, divided into reasonably short lines. UNIX has a large number of programs that take in a line of text, find some patterns, make some replacements, delete some lines, and so on; C is built to facilitate this. Here in the modern day, it's still something of a pain to use UNIX utilities to find patterns that span multiple lines.
Programming languages, more than anything, are written so that programmers could make their own jobs easier. Once that part is in place, our programmers may then move on to producing something actually productive.
Much of the paradigm can thus be traced back to the process of writing a compiler for yet another programming language. This involves taking in lines of text using a standard U.S. English alphabet, then recognizing certain patterns, and then outputting some other embodiment of the instructions in that text. For a C compiler, the output would be assembly language for the computer hardware; for any other language here in the modern day, it's probably C code, meaning that you've converted lines of input text in one language into lines of output in another language.
So the tools that grew up fastest in this era are those that filter text to produce other text. They find text patterns and replace them with other patterns, or take an action given some pattern.
Me, I still write in C. It's the first language that is still in very common use today (see, e.g., http://www.langpop.com/). You can see that I find the central importance of newlines to be annoying, but without the structural constraints of punch card decks, and the added ability to define new structures (which Fortran `77 didn't have), you have just about the simplest language in which you can get serious work done. Fundamentally, the process of writing code is still the process of writing plain English text, and the people with teletypes were the first ones who could do that comfortably.
TV screens
But technology marches on, and next up are cathode ray tubes, graphical user interfaces, and window systems. In terms of typing program code into a text editor, not much changed, but outputting those windows is a non-trivial problem, which can be a mess if you're not organized. When a user clicks the button on the mouse, the computer has to be waiting for it, and has to know to send that signal to the program being pointed to, which has to know if it should send the signal to a button or a window border or a text box, and behave accordingly.
Thus, object-oriented programming, which in its modern form is built around accommodating the text-in-box-in-window-on-desktop sort of hierarchy, managing the demands of multiple windows all wanting to do something simultaneously, and letting a click mean different things in different places.
My impression of coding in the 1990s is that it dove in head-first into an object-oriented paradigm. Not the paradigm, because there are other ways to do OO that are very different. For example, your typical C++ OO code [by no means all] has little interest in message-passing between objects. Nor is OO really an innovation per se, because the ideas existed before. But because of the interest in drawing windows and buttons, objects kinda took over.
Java is the main scar from this period: everything in Java is an object, which still bothers some. After all, the OO setup is perfect for windowing setups, but is not necessarily as applicable for everything else. Is your accounting system or your text editor really best written via exclusive use of a set of object linked via a strict hierarchy? The answer is up for debate, but the influence of windowing technology is such that these object design patterns showed up in accounting programs anyway. Features were taped on to lots of languages, including Fortran and COBOL and Lisp, to include the 1990s idea of how OO should be done.
Fun fact: the Gnome system, which is the desktop on half the world's Linux boxes (that have desktops) is written in plain C, not the object-centric C++. C was once compilable as C++, but the two are no longer at all mutually intelligible. Using the string C/C++ is a surefire way to get geeks to taunt you. I can't guess as to the motivations, but I take this as a pendulum swing back from complex object structures. It's also proof that the whole OO thing didn't really have to happen, and in an alternate universe, windowing systems didn't lead to a trend where every language bolted on an OO syntax.
Network
The latest physical technology innovation is the Web. The original language of the Web was HTML, which has done an interesting job of imposing itself on coding in general. The first few drafts of HTML were pretty directly oriented toward putting text on the screen, plus some tags to format the text so it'll blink or turn puce when you click on it.
HTML's structure has since been used for the expression of all sorts of attributes for all sorts of data, via extensible markup language, XML. HTML is itself a specific case of a broad precursor class of markup, SGML. XML falls somewhere in between. XML is verbose, and is hard to search, and has lots of little details that make it potentially hard to parse. It requires multiple files to get anything done, because before you get to the part where you describe what's going on, you need another description of the elements in your data classification system, and of course, the hierarchy of objects that those elements fall into.
But it is Web-friendly. The trouble with writing for the `Net is that you can quickly get caught up in duct tape. To write something functional, you've got JavaScript (a basically C-like language) sending requests to a server (probably running a scripting language from the 90s like Perl or Python), which might talk to a database sytem running its own ad hoc language, and linking all this so data flows correctly is far from trivial. XML lets you store your data, your form templates, and even the final output, in the same format.
So every language out there has an XML parser, which turns XML into a structure that your program can actually use. Yes, I have an accounting program that stores all its data in XML.
As a digression, JavaScript and Ruby, two web-friendly languages, do have one other nice feature: they make it very easy to write functions that make use of little functions to do simple tasks. If you want all the elements of an array, the elements being xi, to each turn into f (xi), then it's easy to make that happen in one line of code. Like OO, this is nothing new (you can do it in C), but a way of thinking that the language can faciliate. It has an especially strong history in Lisp, which emerged a year or two after FORTRAN. But it still seems to be trendy these days.
That's all I could think of in terms of what the Web has given to the
problem of coding, and it's not the biggest deal. But, like OO, it's
now a requisite for any language to have legs: how quickly can I use
this to output a XML-compliant web page? In 2078, our systems will still
have to be able to read XML, for compatibility with legacy systems
written with web standards circa 2006 in mind.
[link][no comments]
