Probability versus likelihood
Here's the question: should we distinguish probabilities from likelihoods?
In case you didn't even know there was a distinction, here are the definitions. First, let the word odds take our intuitive meaning. A probability gives the odds of an event, given any parameters. Given that the mean is zero and the variance one, what are the odds that the draw will be between 1.1 and 1.2? A likelihood gives the odds of parameters given data. We drew a 1.3 from the distribution; what are the odds that the mean is zero?
Now, the probability can be verified. We can make a million draws from the distribution, and then count up what percentage are between 1.1 and 1.2, and call that the odds. Likelihood can't be verified. We have only one distribution to draw from, so we have no story about re-drawing from millions of different distributions and developing a confidence that the data comes from one or another.
Some folks take this as settling the question, concluding that there's a distinction to be made. The odds of data is relatively concrete; the odds of a parameter is at best a metaphor to probability. The most famous person who stopped here is Mr R A Fisher.
Fisher is an interesting character, in that his techniques are indisputably the baseline for modern statistics, but his larger worldview didn't survive. Do you know the definition of a fiducial distribution (without checking Wikipedia)? Well, it was central to Fisher's overall system. Fisher was the head of Oxford's Department of Eugenics, and the reputation of that field hasn't been very good ever since that one holocaust. I do not claim that Fischer endorsed the Holocaust. However, he was clear in his endorsement of selective breeding and other such Eugenic principles that we now consider to be politically incorrect and/or evil.
Fisher was vehement in distinguishing between probability and likelihood--in fact, he coined the term likelihood to make that distinction. There's a comment on the matter from him on p 329 of Modeling with Data.
But things are a little more blurred than that. We'll start with the probability side, and the story above about making a million draws of an event. This is problematic for more cases than it is easy. What are the odds of rain tomorrow? We only have one tomorrow to live. We could look at comparable prior days, but how similar must a situation be before it's properly comparable?
What percentage of cars passing an intersection are SUVs? If we observe over the course of a day, aren't we confounding the rush hour rate with the mid-day rate and the midnight rate? If we take full days, will Monday have the same rate as Sunday, and does a Monday in January have the same rate as a Monday in April?
The frequentist interpretation of probability assumes an infinite stream of data that is identical in all manners but the single variable we care about. This is obviously a fiction, but we don't mind because there are enough cases where it approximately works. Our weatherfolk and surveyors worked out something serviceable and run with it. We can think of regression analysis as an attempt to patch over the `all else equal' assumption.
In the probability v likelihood context, the distinction starts to blur. We only have one distribution, so the likelihood is a human-invented fiction. We only have one tomorrow, so the probability of rain is also a human-invented fiction.
Now let's go the other way, and consider how imaginary or subjective these parameters are. Maybe we've observed a few manufacturers and we know that contaminants per million is Normally distributed with a different, observable-by-history mean for each manufacturer. Now we have a situation where the parameters of the Normal are not taken from an imaginary distribution, but are just as much observed as the data is. The odds that a given data draw is taken from one Normal distribution or another can be calculated from the records on hand.
The joint distribution
Me, I spend a lot of my time writing code. What would a probability function look like? It would take in some data and parameters, and put out a number between zero and one: p(data, params). What would a likelihood function look like? Well, it would take in some data and parameters, and put out a number between zero and one: l (data, params). And in fact, for a given model, like a Normal distribution, the two functions are identical.So how's that for a trip: by our traditional interpretation, a function viewed one way (with the parameters fixed) is an objective and verifiable fact of nature; the very same function viewed another way (with the data fixed) is subjective and human-invented. Let the data be x and the parameter be β, then P(x| β) is objective and P(β| x) is subjective.
In a couple of ways, P(x, β) is a combination of the objective and subjective. A full, unconditional distribution of data, P(x), would certainly count as objective in our classification scheme (assuming away the practical problem of gathering it), and P(β| x) is the subjective likelihood, and we can combine the two to produce the joint, not-conditional likelihood P(β| x)⋅P(x) = P(x, β). Of course, you can also do it the other way, again combining an objective and subjective component to get P(x| β)⋅P(β) = P(x, β).
So is P(x, β) objective or subjective? What can we do with such a function?
I have no idea what Fisher would say though I suppose this is easy enough to research; I invite comments from anybody who has done so. He seemed to be resistant to allowing the two conditional probabilities to be merged, and took pains to distinguish between a function of the data given the parameters (such as a probability) and a function of the parameters given data (of which the fiducial distribution is one); I'm not sure where he'd class the joint distribution from which both sides can be derived.
The post-Bayesian modernists are fine with accepting the entire thing as subjective. We don't know what probability is, and we don't know what likelihood is. It's all subjective from top to bottom.
Here, I'm taking something of a more moderate position, because I have no idea whether the fundamental philosophy questions are even answerable. I can tell you that for any sufficiently well-specified situation, I can give you a function P(x, β), from which we can derive slices that are functions only of the data or of the parameters. Sometimes, this is a mix of the data, the model, and subjective beliefs; sometimes it's just a table of observed data.
So that's why I don't distinguish between probability and likelihood. The philosophy issues are hard to untangle, and it's easy to find equations that are objective, subjective, or a mix of both depending on context and opinion. We may make distinctions between parameters and data, but the probability/likelihood formula P(x, β) doesn't care about the distinction. It's just a function where one input is a Roman character and the other Greek.
[link][no comments]
Data is typically not a plural
When we learned all those darn grammatical exceptions, we were usually told that they came about in some distant past, due to some arcane relic of old Dutch or something. But here in the new millennium, we have the chance to witness the development of a new grammatical exception.
If this sounds boring, bear with me: by the end of the column, about 360,000 people will die over this corner of grammar.
See, English has the concept of a collective singular, wherein a group of elements is treated as a unit: e.g., that clump of birds is moving pretty fast. The new exception is that this concept can apply to any group of anything except data. The data shows a steep slope is considered incorrect by some, who prefer the data show a steep slope.
If you are one of the people who think that the data is is wrong, please stop.
Some examples
First, let us imagine a world where English grammar would require all
groups to remain plural:
1. The agenda are on the table.
2. The trivia in this book are silly.
3. Steely Dan are playing at the pavillion.
4. The NIH owe me $12,000.
5. The U.S.A. are in a recession.
Notes:
1. Agendum/agenda has the same Latin-based form as datum/data.
Yet I have never heard a person who uses the data are use the agenda are.
2. Sentence #2 is the only one that is actually incorrect, due to the odd history of trivia. Here's the definition of trivium from the OED: in the Middle Ages, the lower division of the seven liberal arts, comprising grammar, rhetoric, and logic. That is, trivium was itself once a collective singular. The meaning evolved, and we can now group together a collective unit of facts about the trivium into bundles that are collectively a unit: trivia. In the present day, trivia is always a singular, because trivium refers not to individual facts but to the above fields of study. The singular of trivia is basically lost. And since I know you're gnawing to know, the other part of the seven liberal arts is the quadrivium: “the four mathematical sciences, arithmetic, geometry, astronomy, and music”.
3. Bands and orchestras are a great example of the whole being more than its parts.
4. The acronym in number 4 expands to National Institutes of Health, and they do continue to “lose” my invoices as quickly as I can send them. Acronyms are a great way to cohere a plural into a singular.
5. The 360,000 casualties mentioned above come from #5: the question of whether the U.S.A. are or the U.S.A. is is the difference between a Confederacy and a Federation, and was basically resolved by a civil war. People fought and died over the question of whether a set of elements should be taken as separate elements or a unit, just a box of parts or a coherent whole.
More mundane examples still reveal different points of view. Both the flock of birds are flying and the flock of birds is flying are correct, but one or the other probably sounds off to you. Maybe you flinched when I wrote agendum/agenda has at bullet point one above. Here, grammar is a window to the soul. I think that some people generally lean toward seeing the parts and some generally lean toward seeing the whole. Linguist readers are welcome to leave citations regarding my claim in the comments. In one case this difference in thinking led to a war, but in most cases it seems to just lead to people correcting other folks' grammar when the grammar really just reflects a difference in perception.
Oh, and hair is an interesting case: there's a form your hairs for a set of items that is not to be taken as a whole, and your hair referring to the whole mop on your head. It'd be great if we'd evolved more pairs like that, like maybe datums and data.
The math section
Let's get back to data, which is in the mathematical realm. Precision matters in math, and grammar needs to follow along. The sentence that set of numbers is prime is incoherent: only the individual numbers can be prime; a set can't be prime. The sentence that set of numbers are dense1 is incoherent: only the set as a whole can be dense; individual numbers are not dense. We need both the set is and the set are in our grammar.
Similarly with data: sometimes we are looking at the gestalt, such as statistic like the estimates of a regression parameter; sometimes we are looking at the individual elements, such as when we point out that all the numbers are positive. The data are a matrix is incoherent: on the left-hand side of the are, we refer to a plural, while on the right-hand side, we're stating a singular; the sentence reduces to a plural = a singular. It's a perfect demonstration that the left-hand side is meant to be taken as a collective singular, as expressed perfectly by the data is a matrix.
Efforts have been made to base the entirety of mathematics on sets of objects; in a world where collections are central, we desperately need both the set of items is and the set of items are to function; the data is/the data are is just a synonym.
Why the new exception?
Disclaimer to Ms. LDWH of Princeton, PA: the following paragraph does not apply to you. I know you're just following the darn style guide.
So why are the agenda is and the set of elements is OK, while the data is is now considered to be wrong? I can't put this politely, but I get the vibe that the people who correct the data is are just trying to indicate smartness--and failing. The process is perfect for the person working too hard at smart: (1) Identify trivia: data is actually a plural, and has a Latin-sounding singular. (2) Payoff: feel smarter for knowing trivia. (3) Find somebody who seems to not seem to know your fact. (4) Big payoff: correct them!
Another of my pet peeves, which I've mentioned before, fits the same form: the use of methodology (the study of methods) as a synonym for method. Look at me! I used a five-syllable word! I think it's a synonym for a two syllable word, but I chose to use the longer word anyway!
But, as above, there are times when data is a pile of parts, and
times when it has meaning only as a whole. In all sorts of situations,
our brains are wired to sometimes see the parts and sometimes the whole,
and there's no point starting wars with people who see things differently.
Footnotes
- ... dense1
- Dense: between any two elements of a set, there is another element of a set. E.g., between the real number 1.1 and the real number 1.2, there is 1.15.
[link][2 comments]
|
on Wednesday, June 17th, ajuc said I'm not native speaker, so I'm probably wrong, but - you gave example "that set of numbers is prime" which is of course wrong. But did you mean that "that set of numbers are prime" would be correct in English? It sounds very wrong to me. |
|
on Wednesday, June 17th, Anonymous said
@ajuc |
Better variadic functions for C
I really dislike how C's variadic functions are implemented. I think they create lots of problems and don't fulfil their potential. So this is my effort to improve on things.
A variadic function is one that takes a variable number inputs. The most famous example is printf, where both printf("Hi.") and printf ("%f %f %i \n", first, second, third) are valid, even though the first example has one input and the second has four.
Simply put, C's variadic functions provide exactly enough power to implement printf, and nothing more. You must have an initial fixed argument, and it's more-or-less expected that that first argument provides a catalog to the types of the subsequent elements, or at least a count. In the example above, the first two items are expected to be floating-point variables, and the third an integer.
There is no type safety: if you pass an int like 1 when you thought you were passing a float like 1.0, results are undefined. If you think there are three elements passed in but only two were passed in, you're likely to get a segfault. Because of issues like this, CERT, the software security group, considers variadic functions to be a security risk (Severity: high. Likelihood: probable).
I understand that the designers of the system are reluctant to impose too much magic to make variadic functions work, like magically dropping into place an nargs variable giving an argument count. So today's post is an exercise in how far we can get in implementing decent variadic functions using only ISO C. To give away the ending, I manage some of the things we use variadic functions for in a safe and more convenient manner--optional arguments work very well--but it takes many little tricks, and I'm still short of true printf functionality.
Designated intializers
First, a digression into a pair of nifty tricks that C99 gave us: compound literals and designated initializers. I find that many people aren't aware of these things, because they're learning C from textbooks written before 1999, and using compilers that may not use the 1999 standard by default.
Darn it people, it's been a decade. This is not new.
The idea is simple: if you have a struct type, you can use forms like these to use an anonymous struct wherever it's appropriate:
typedef struct {
int first, second;
double third;
gsl_vector *v;
} stype;
stype newvar = {3, 5, 2.3, a_vector};
stype nextvar = {3, 5};
newvar = (stype) {.third = 3.12, .second=5};
function_call( (stype) {.third = 8.3});
In each case, a full struct is set up, and the compiler is smart enough to know what goes where among those elements you specified, and sets the other elements to zero or NULL.
These sorts of features that we have for initializing a struct are exactly the sort of thing many more recent languages put into their function calls: default values are filled in, and named elements are allowed via designated initializers.
At the end of the example, I put a compound literal inside a function call, so we are technically calling a function using these pleasant variable-input features, but it's not yet looking much like printf.
Cleaner function calls
We can clean up the struct-to-function trick to get a lot closer to variadic functions. Here's the agenda for making this work:
- For each function, set up a struct where the elements of the struct are the inputs to the function.
- Produce a shadow function whose sole input is that struct, which sets the default vaules and then calls the original function.
- Write a wrapper macro so that the instead of the user having to type the full compound literals form f( (ftype) {arg1, arg2}), they can just type the usual f(arg1, arg2).
So, here it is. The first third is a set of general macros, the second third sets up a single function, and the last third actually makes use. This program should compile with any C99-compliant compiler. After the code, I'll have some detailed notes to walk you through it.
#define varad_head(type, name) \
type variadic_##name(variadic_type_##name x)
#define varad_declare(type, name, ...) \
typedef struct { \
__VA_ARGS__ ; \
} variadic_type_##name; \
varad_head(type, name);
#define varad_var(name, value) name = x.name ? x.name : (value);
#define varad_link(name,...) \
variadic_##name((variadic_type_##name) {__VA_ARGS__})
///////////////////// header + code file
varad_declare(double, sum, int first; double second; int third;)
#define sum(...) varad_link(sum,__VA_ARGS__)
varad_head(double, sum) {
int varad_var(first, 0)
double varad_var(second, 2.2)
int varad_var(third, 8);
return first + second + third;
}
///////////////////// actual calls
#include <stdio.h>
int main(){
printf("%g\n", sum());
printf("%g\n", sum(4, 2));
printf("%g\n", sum(.third=2));
printf("%g\n", sum(2, 3.4, 8));
}
• There are three macros in the first section, roughly corresponding to the three steps of the agenda. varad_declare declares a special type and a function to use that type. Notice that the third and later arguments to the macro go into the struct, not a function header, so variables are separated by semicolons. varad_var sets default values for each variable. varad_link is used to clean up the function call.
• The second third sets up a single function. The bulk declares that intermediate function that takes in a struct, sets default values, and calls the real function.
• There is one more macro in this section, which needs to be rewritten for every new function. It'd be great if there were a macro to just churn out this trivial macro for each new function, but you can't write macros that generate macros. Why not? I dunno. Seems like it wouldn't be a big deal for the preprocessor, but them's the rules. The too-simple preprocessor is my second big complaint about C.
• The main part of the intermediate function has a line for each element of the struct, declaring an intermediate variable and setting a default value. The compiler gave missing elements a default value, but we often want the default to be something other than zero. We can also have more intelligent defaults based on variable information, like maybe int varad_in(third, first * 3).
• The third part is a call to the function we've set up, and you can see that it works great: we can give it no arguments, all arguments, named arguments, or whatever else seems convenient, with no regard to the internal guts from prior sections.
Infinite input
OK, so far, the result looks much more modern relative to C's standard fixed inputs. It allows optional arguments, and named arguments. It checks types, and complains during compilation if you've got mismatched types, meaning that a lot of the security holes of the standard variadic form are gone.
But we want more from our variadics than just optional arguments: we'd like to specify arbitrary-length lists. Can we declare a structure that could take an arbitrary number of inputs, such as a function to sum n inputs?
The short answer is no. The long answer: the last element of
a struct can be an array of indeterminate size, to be allocated at
compile-time. When the anonymous struct is being generated for the
function call, a compiler could count the elements at the end of the
list and allocate the variable-size array appropriately.
However, this depends on whether the anonymous struct is dynamic or
static. By static, I mean something produced at the initialization, like
the constants or global variables; by dynamic, I mean variables that are
initialized along the way during the run. For
static variables, the variable-length last argument will be stuck in the
form set at the
first allocation; for dynamic variables, there are more options. So what
are the anonymous structs used for the function calls here?
It is my reading that the ISO C standard doesn't demand
things one way or the other, so we can't rely on dynamic allocation of
the type we'd get elsewhere via a line like x =
(structtype) {1, 2, {3, 5, 9, 10}}, where the variable-lenght
allocation would be valid.
Also, for an array of fixed length, you can usually
get the size by sizeof(list)/sizeof(list[1]). So if the system
allocated the right-sized list, you wouldn't even need a separate nargs element taking up space.
We're instead stuck just making up a size for an element of the struct, like 1,000, and letting the compiler pad it with zeros. Given that we generally use variadic arrays for lists of items hand-typed into the code, and array inputs for lists of truly arbitrary length, we can probably get away with 1,000 inputs max, but it's certainly not ideal.
//Put the macros from the first third above in "variadic.h".
#include "variadic.h"
varad_declare(double, sum, int first;
double second; int third[1000];)
#define sum(...) varad_link(sum,__VA_ARGS__)
varad_head(double, sum) {
int varad_var(first, 0)
double varad_var(second, 2.2)
int * varad_var(third, NULL);
double sum = first + second;
for (int i=0; i< 1000; i++)
sum += third[i];
return sum;
}
int main(){
printf("%g\n", sum());
printf("%g\n", sum(4, 2));
printf("%g\n", sum(.third={2}));
printf("%g\n", sum(2, 3.4, {8, 8}));
}
#endif
So the third element can have variable length, as desired. If you're not sure of the type coming in, the last element can be an array of void *, where void * is your signal to the compiler that you're willing to take your chances on types and have a catalog or system on hand to do your own casts.
How're we doing?
So there's the story: we can do a lot better with our variable-length function calls than we do, and it's not even something involving crazy re-writing of everything. Standard C already gives us the tools to go 90% of the way.
However, it's a frickin' pain to set up. C's preprocessor is limited, and we had to write several macros to make this happen. For every function, the namespace has to have another auxiliary function and a type floating around. You won't notice this normally, but it can create quirks in the debugger and other places that expect a little more normalcy out of the code base.
Apophenia uses this setup for a few dozen functions, but with a few more tricks. Notably, everything is wrapped in #ifdefs to let everything degrade to the standard function call if needed. Many things beyond the compiler eat C code, like documentation generators, interface generators, &c. Even though all the above is 100% standard C compliant, some systems like the setup more than others. I also wrote a sed script to generate all this boilerplate from appropriate markers. The script also this gets around the problem that we can't use the preprocessor to generate macros.
OK, summary paragraph: we need to fix C's variadic function calling
scheme, which is built around printf to the detriment of many
other possibilities, and even to the detriment of security.
Being that we can already do most of what we want via ISO C99,
we can fix them without introducing incompatibilities or changing the character of C.
But given
the amount of extras and tricks involved, and given that we still don't
quite achieve proper variadic functions, there'd need to be some fixes
in the language itself to update variadic functions to a modern form.
[link][no comments]
Alternatives to Word
Part six of six
At this point, I hope I've demonstrated the efficiency gains in having a means of just writing content, a means of just formatting content, and a standard format linking the two.
In every case I can think of, the text writing part is in what you'd call a text editor. As above, there are many that you could choose from. Your OS provides a very basic one with zero learning curve and few features (Windows=Notepad, MacOS=TextPad, Unices=pico), but you can find others that are more comfortable to live in for large projects, like EMACS, vi, Ultraedit, Notepad++, and a whole lot more.
The rest of the story breaks down as to your preferred output and the closely-related question what standard you're going to lean on.
Plain text
One option is to not use a formatting system at all. Just open up your preferred text editor and go to town.
Hemmingway:
- Brief.
- Did not use bullet points.
- Used un-formatted plain text.
But Hemmingway was fortunate enough to live before word processors. Today, an unadorned block of text is unacceptable, meaning that you will probably have to move your plain text to some sort of formatted system.
HTML
The Web has a text-based standard that can be successfully read by dozens of web browsers on all types of computer. HTML documents from the birth of the web in the mid-80s can still be read today. Even Word can read HTML.
HTML stands for HyperText Markup Language, and although the HyperText part is probably not too relevant to the discussion here, the Markup Language part indicates that this is exactly the sort of semantic language discussed above. This is especially true with the advent of Cascading Style Sheets (CSS). CSS lets you define a class, and describe how that class is to be formatted on the screen. Then, you mark up your text with class delimiters: this is a header, this is a digression. That is, HTML with CSS is exactly the sort of semantic markup language that we're looking for.
Your colleagues will be able to read these documents with their web browser, and even edit them with software on their computer.
If you don't want to write the HTML markers yourself, there are a few systems that will turn easy-to-write plain text into proper HTML. Txt2tags, markdown, or textile specify easy plain-text markers, like **boldface**, and then they'll filter that into the correct HTML.
LATEX
If you are in academia, use LATEX. It was written for academic publishing, and universities are used to LATEX users. It is designed around semantic markup of articles, books, and letters, and pegs them perfectly. This document is written in it, and as you can see, it looks beautiful. Any journal you want your papers to be seen in accept (and frequently prefer) LATEX-formatted documents, and will provide you with a style sheet to apply to your document so that you can conform to their rules. Mathematics in Word looks amateurish, because only 0.02% of Word's buyers have equations in their papers; LATEX's math typesetting makes you look smarter instantly.
It is not a strictly semantic markup, but a bit of a hybrid. I think it does a good job of combining the two, and if you want stricter semantics, then you are welcome to add \defs to the top of your documents to effect that.
One thing Word is good at, by the way, is deliberate inconsistency. If you want your first page in Helvetica, your second in Times, and your third page to be two-column format, this will be a pain in most semantically-oriented systems. But because Word's literal markup has no mechanism to impose consistency on the document, inconsistent formatting is much easier than in LATEX. So there's my token compliment to Word.
If you are not in academia, then you have a stronger compatibility-with-Word problem, but consider using LATEX anyway. Because there are reasonably effective (but imperfect) LATEX-to-HTML translators, you can think of the language as a document-oriented HTML-producing language, and can then send HTML to your trapped-in-Word colleagues. This method will especially benefit those who want to use bibtex or makeindex to autogenerate the end matter in larger works.
Now, the above methods require work and learning, but I hope by now you agree that spending time learning something that you will use every day for years is worth the effort. But, I'm not going to tell you how to go about learning HTML and CSS markup or which text editor to use. You know how to ask your favorite search engine for “efficient text editor”, “HTML tutorial” or what have you. Many of these open standards and tools are entirely free, so there is at least no financial cost to downloading the tools and playing around. Better than the search engines is to ask your favorite guru for help; many are happy to take time to help a friend work more efficiently.
Also, because standard formats are so open, there is probably somebody who has already fixed every problem you have, but it might be a separate tool. Some text editors include a spell checker, some expect you to choose an external full-time external spell checker from the various available options. If you want to see the difference between your version of the document and the one your colleague edited, your editor may include a dedicated diff mode, or you may need a copy of the diff program.
Word Format
You may have to use the Word document format at your workplace, though you can continue to use the structure above: use a plain text editor to write plain text, perhaps using format markers like those above, then, at the last minute, open the document in Word. That is, spend the bulk of your weeks of editing and revising working on content and worry about format and visual appeal only as a final step.Because of Word's fundamentally first-person paradigm, you still need to change your format markers to real formatting yourself, but (1) Word's macro feature can help with this, and (2) you may still save time and effort, because the editing features of text editors can add that much more efficiency.
OpenOffice.org is a word processor initially from Sun Microsystems. Due to trademark issues, they can't just call it OpenOffice. Its key claim to fame is that it can read and write Microsoft's document formats very well, meaning that you can interoperate with your coworkers without their knowing that you aren't one of them.
Its stylist solves many of Word's style editor problems, so you may have better success with using it semantically. It has a built in bibliography database system. Maybe Mr. bdamm will get his wish for a basic vi keymap for efficient editing. Its own format is open, and you can save anything to PDF. So complaints about some details are alleviated, but it tries to imitate Word to the point of imitating Word's paradigmatic failings. The literal markup, intuitive-over-efficient, and one work = one view paradigms remain.
Conclusion
A great many people have spent a great deal of time thinking about how to best edit and format text, and most of them have come up with solutions that look very, very different from Word. Part of the reason for this is that the authors of Word were writing for Aunt Myrtle, while the author of LATEX was writing a package for his own use; meaning that Word was built around ease of initial use, while LATEX was built around efficiency. There is no metaphor that one could make between an HTML document with a cascading style sheet and a physical paper with text--but this is liberating and allows for new possibilities and an easier time with formatting.
Perhaps you are stuck with Word, and company policy dictates that you
write and maintain long, complex business documents using the same tool
Aunt Myrtle uses to write her thank-you notes. Hopefully this paper has
given you some ideas for working more efficiently: use the style sheet,
stick to plain text where possible, maybe get a copy of OpenOffice.org
on the sly for saving to PDF. But hopefully you have the liberty to take
the effort and time to learn some of the other paradigms. It will take
you days or even weeks, your first documents will look amateurish, and
over the next several years of your career you will thank yourself over
and over again as you gracefully produce output with truly efficient
tools.
|
on Thursday, June 25th, Guilherme Freitas said Light markup languages go a long way for simple documents. I have used Txt2tags and ReStructured Text, and I found them very nice to use. You can even produce PDFs directly from ReStructured Text without having have LaTeX (via a library called ReportLab). |
Word and standards
Part five of six
The World Wide Web consortium (the W3C) maintains the standards for what is a valid web page, and they provide a validator for web authors to use to check the validity of our own pages, at http://validator.w3.org.
Most authors could care less about validation. They figure that if it looks OK on the browser they're using, and maybe one other, then they're done. For example, try validating the home page of the World Bank (265 errors).1
Even as esteemed an organization as the Library of Congress (whose front page validates perfectly) has considered building web pages that violate standards to the point of only working in one brand of browser, but at least they were polite enough to float the possibility with a request for comments first. Tim Berners-Lee, the author of the original HTML standard and frequently credited as the founder of the Internet, submitted a comment that explained the importance of documents written around standards instead of programs:
At the outset, we would like to stress that nothing in this letter should be construed as a criticism of Microsoft's Internet Explorer [...]. We would write the same letter if the choice was to offer support solely for Mozilla Firefox, Safari, or any other product. [...]
While a large proportion of the marketplace uses the Microsoft Internet Explorer to browse the Web, certain classes of users will find it either impossible or extremely inconvenient to do so. [...] Users with disabilities often must augment their browsing software with special assistive software and/or hardware (“assistive technology”). [...] In addition, some individuals with disabilities rely on alternative browsers (for instance, “talking browsers”) that are designed to meet their specific needs. Users with disabilities rely on a standards-based Web to ensure that services they access on the Web will be usable through the variety of mainstream software and specialized assistive technologies that they use.
He also points out that when a security flaw is found in a product, people or institutions will often switch to a competitor until the security flaw is patched. That is, even we of decent eyesight would do well to keep a variety of readers on our hard drives (I use three). This is obviously only possible if a variety of readers can all understand the same document format.
Extending the standards
So standards are good. But despite the obviousness of that statement, folks still insist on not complying. The politics are typically around the more blatant forms of standards-ignoring, such as the LoC's self-conscious proposal or Microsoft's seeming inability to correctly implement a standard written by anybody but themselves (see below). But more subtly, everybody writes extensions. Sure, you can stick to the standard, the ad copy explains, but if you use our product, you can also use this nifty widget that doesn't appear in competitor's products. This means that the vendor can tout its product as both 100% standards-compliant and at the same time better than the standards-compliant competitors. Many users are clearly happy with this, but it is a siren that will surely leave the user stranded, and is a nice way for the vendor to lock users in while claiming full compatibility with competitors. [Dear GCC users: many of the GNU extensions to standard C don't even work in Cygwin. Ignore them.]
Once an author has written a document featuring standards-plus-extensions, recipients have exactly the same onus of trying to get things working as if the author had ignored the standards entirely. For example, a plain browser is fine with HTML, but with the right plugins, you can also view the Macromedia Flash format. You've no doubt seen more than enough web pages like Thaiphoon's, whose Panang tofu is a delight. Their menu, which is plain English text, can only be read if you have Flash installed. I don't know what web designer talked the managers into developing this, but they committed a grievous sin if they promised that this would attract more customers than a plain HTML menu that every browser can instantly load and display.
Surely, the most common reason for ignoring a standard is that it does not allow for some form of expression that the author eagerly wants to use. But the author needs to bear in mind that freer expression bears all the costs of broken standards. My favorite Thai restaurant near work, Thaiphoon,2 has a website that I sometimes check so I can order ahead. When I open with my usual browser, I get a notice that I need to get the Flash plugin to view the site. Since I'm checking from a heavily restricted work computer, I can't install Flash, and often wind up eating at the Chinese place instead. But what happens when I visit the website in a Flash-enabled browser? I get a menu. A plain English text menu.
Or consider the sad state of email. Like a restaurant menu, about 100% of email is also plain text. You tell people things, using words. For about 40 years, there has been a standard (ASCII) that allows different programs to interpret text correctly. Ah, what Nirvana: all the information we need to get across can be gotten across with an easy and supremely well-supported standard. If the UN worked this well, we would have world peace. In fact, now that the computing world is increasingly international, there are more character sets than English-centric ASCII, but nearly every known language is supported by the Unicode standard (yes, Ogham, Ugaritic, Deseret, and Limbu are in there.) Yet people increasingly throw the standard out and encode the text into a word processor document in a proprietary format. If you're lucky, you have a word processor that can read the proprietary-format documents your colleague emailed. For example, if the sender has Word 2000 and the recipient has Word 95, communication won't happen. Putting plain text in a word processor document--even with a bit of extra formatting--is exactly on par with putting a plain old menu in a Flash plugin: yeah, there's a little more glitz, but it comes at the price of potentially excluding, imposing work upon, or alienating the reader.
Of course, word processor documents are nice because they do provide extensions on top of plain text. They let you control the font and layout that the recipient sees in ways that plain text can only approximate. Flash certainly does things that HTML will never even think of supporting. But there is a trade-off that many people ignore, under the presumption that everybody is just like them. “Well, I have a copy of Word 2000 and an email client that displays web pages, so everybody else must too. My eyesight and dexterity with mouse and keyboard is fine, so my recipient's must be too.” In a social context, the presumption that everybody is like you is the source of a great deal of impoliteness, offense, and general unhappiness, and we teach people from early childhood to understand that others are not like them and that they should maintain standards of decorum until they know that the other party is OK with breaking them. Sure, we can wear the risquè t-shirt to work and maybe make some people smile, but we know that such free expression carries a trade-off in the form of a risk of offending some. We should do the same when writing documents: stick to the basic standards unless we have a reason to do otherwise and we know that the recipient is OK with our new-fangled alternative.
There do exist valid reasons to ignore standards or set out to establish new ones; e.g., the correct response to a spoken “thank you” is “you're welcome”, but it is accepted custom to send a “thank you” email but not a “you're welcome” email, because that sort of thing just sort of clutters up the in box. But those who ignore the standards for no reason or for lousy reasons (“I don't have to say thanks--he owed me.”) are just rude.
Bringing it back to the subject at hand, Word establishes its own standard, when it doesn't have to. First, users often write a Word document when a simple plain text file will do. An email with no text in the body but a Word attachment with a single paragraph of plain text is a waste in every sense.
Second, there are standards that do approximately everything a Word document does, such as HTML. You can probably think of a few things that you can do in Word that you can't do in HTML. You can also probably live your entire professional life not using them.
Alternative tools
Microsoft goes out of its way to make its DOC format opaque, because users are better locked-in if they can only edit their colleagues' documents with Microsoft tools. But I promised you a paper that does not discuss Microsoft's business strategy, but how Word's design hurts your efficiency. The closed-format design means that, by definition, the only way to edit a Word document is in Word.
There are literally hundreds of editors for a LATEX or HTML document. You can use anything that can read ASCII-formatted files--even including Word. That means that a market has sprung up that eagerly attempts to appease the needs and skills of different users. As above, EMACS and vi are specialized text editors and therefore have dozens of commands to just edit text, but there are hundreds of other text editors that I didn't mention; pick the one that most fits your lifestyle and run with it. For Word documents, you have no choice but to edit them in Word.
On the output end, there are a wide variety of programs that read LATEX-formatted documents and display them via formats like HTML, PDF, or plain text. Because the file format is open, many people have implemented programs to process LATEX-marked text to produce interesting new output. I'll have more such options in the sequel.
Meanwhile, the only thing you can do with a Word document is open it in Word. If Word is not to your liking for any reason, you are stuck. If you need to output something besides Word DOC format, you had better hope that Word allows you to do the conversion.3
XML
The more tech-savvy readers know that the latest version of Word uses the extensible markup language (XML), which is a commonly-accepted standard for semantic markup. However, this is slightly misleading. First, there is not yet a mechanism to write your own style sheets as I described above. Markup like <b>this</b> is valid XML, but it's just an elaborate way to say boldface. That is, Word takes a system designed for semantic markup and uses it for literal markup.
But more importantly, using the XML standard does not yet mean easy interoperability. XML is a format for writing down data in a tree structure using plain text, so that it can be easily parsed by readers in any system. XML parsers are common in most coding languages, your browser, Word, and many telephones. But once you've got the XML tree read in, what can you do with it?
An XML file depends on a companion document type definition (DTD) file that explains what headings and types and modifiers are available. There are many, depending on your purpose. An address book will define structures for people and organizations, while XHTML defines headers and tables. Two XML systems that read different DTDs are, in the end, incompatible. If one system marks paragraphs with <p> and another with <par>, then the two won't be able to do anything with each other's data, even though parsing the XML structure will be a non-issue.
Word's XML is Word-specific. Politics: although there exist open DTDs for text documents, including DocBook and OpenDoc, Microsoft is insisting on supporting one and only one XML schema: its own. It has applied for patents on that schema in the U.S. and Europe, and although it has stated that it will allow others to use its soon-to-be-patented technology for free, many are wary of whether the format will remain open.
So Word's XML is a near-miss: it solves the problem of parsing the bits on the hard drive in a standard manner, but it doesn't take advantage of the possibility of semantic markup, or of using any of the myriad existing formats that are well-supported by others.
I've argued for the value of decoupling the interface from the document, so that if you don't like how Word does its thing, you can use another tool to edit your document, and then send your finished product to a Word-using colleague who doesn't care what you used to produce the document. But for Word's format such options are limited. There are many tools that will do certain limited operations; there are a handful of competing word processors that try to look like Word, which get within spitting distance of fully supporting Word documents; and that's about it for handling Word's format. I am not aware of a single non-Microsoft product that claims 100% compatibility with Word's XML format.
So, as long as you're using Word's format, you're more-or-less stuck using
Word.
So next time I'll present some cleaner breaks that use recognized
standards.
Footnotes
- ... errors).1
- Stats are from validation attempts in late 2005. I sincerely hope they do better if you try them today.
- ...http://www.thaiphoon.com/,2
- CT & S, NW DC. Try the Panang tofu.
- ... conversion.3
- Yes,
many people try eagerly to write Word-document compatible extensions,
with varying success. But the market for such extensions is absolutely
miniscule compared to the market around plain text.
OpenOffice.org will save DOC files as PDFs, by the way. Even if you are married to Word, you may want to download OpenOffice.org and keep it around exclusively as a PDF converter.
|
on Friday, May 22nd, spoof said Nice post. I suppose you heard the news that MS2010 is planning to support OpenType, for whatever that's worth. |
The ergonomics of the down arrow
Part four of six
I wrote this in late 2005: Google recently put out an RSS reader. It's pretty cute, and I personally have switched to it.
If you aren't familiar with RSS, then that is no matter here (it's a syndication system for web sites). The interesting feature of the reader for our purposes is that the J key will let you go down in the list of headlines. Yes, J, as in, uh, jo down. K, as in kup goes up in the list. There is absolutely nothing mnemonic about the J and K keys, but they feel wonderful. I assume you knows how to type properly, with hands on the home keys; I generally find my hands are on the home keys even when I'm just staring at the screen, and my hand doesn't need any help from my brain to find the little nubbin on the J key.
But that J key. It's the index finger of 90% of the world's dominant hand, and the keyboard is designed so that that index finger knows exactly where to rest. Moving down on the page is the most common operation, both in reading and even editing, so it makes complete ergonomic sense to attach this to the strongest finger of the strongest hand. Even the lefties will have no problem with it.
But it flies in the face of all mnemonics. Maybe you can come up with some word having to do with the process of scrolling down that begins with the letter J, but I've got nothin'. Nor could I think of a more efficient keymap.
I personally think the use of the J key is easy to learn because of its ergonomic delight. But it throws ease of initial use out the window--almost belligerently. You want to use the nifty hotkeys? Then RTFM.
An interface which works against intuition can be destructive, so if U went down and D went up, we'd have to write off the application as hopeless, but J doesn't work against anything. It's just a gesture.
Within a week of Google's RSS rollout, Bloglines, a competing RSS aggregation service, added a little header to its page: “You can now navigate through Bloglines with hotkeys[...]: j - next article k - previous article [...]”
Anybody familiar with the OpenOffice.org internals? bdamm (at) openoffice (dot) org will give you a hundred bucks to write code to have J move the cursor down a line (plus a handful of other keystrokes like K).
The war
Lest you think this J thing is some sort of recent meme, it all comes from vi, a text editor written in 1976. I am using a version of vi (named vim) to write this right now. Let's pause for a second and let that sink in: most programs have a shelf life of about six months, and this guy wrote a program thirty years ago which is still in somewhat common use today. j goes down, k goes up, {jfw will go to the first instance of the letter w in your paragraph, and, since I can't stand seeing that unclosed open-bracket, I have to tell you that }j%d% will delete a parenthetical remark in the first line of the next paragraph. Which is all to show you that Mr. Joy, the author of vi, fell soundly on the efficiency side of the efficiency vs intuition scale--and that is why his text editor has survived for thirty years, and is being imitated by cutting-edge web services.
We sometimes like to write documents that actually have Js in them, and vi thus has modes: in editing mode, j goes down and d$ will delete the rest of the line; in insert mode, the j key puts a j on the screen, and typing d$ puts gibberish on the screen which quickly reminds you you're in the wrong mode.
There are two competitors to J. The first is the ctrl-D school, rooted in EMACS, written by a certain Mr. RM Stallman. EMACS's keymap is sort of like vi's, in that it's not particularly intuitive, but once you've learned it, you're done. However, it's a compromise along the efficiency vs intuition scale, because you don't need to deal with the unintuitive modes but reaching for the ctrl key all the time is not nearly as pleasant as twitching your index finger to hit the j key.1 The EMACS vs vi war is a long-standing one, which is just silly, because they're of basically comparable efficiency. No, there are other schools that are a real drain on the economy, like the down-arrow school.
Let me take a paragraph or two to make this as clear as possible: the down-arrow school is a total failure when it comes to efficiency. On my screen right now, getting to the first w in the last paragraph via arrow keys is 27 keystrokes (using ctrl-arrow to go by word where possible). It's about three or four seconds for a single navigation. Do forty three-second navigations in a day and you're already up to nine hours in a work-year--a full work day a year just hitting the arrow key. You get to multiply by your wage to see what your company is spending per annum to facilitate ease of initial use. Even if it's one tap of the arrow key, your hands are already off the home keys; going off and on again is another half-second. If you do a hundred arrow-key navigations in a day (and if you're an office worker who does a lot of writing, you probably do closer to a thousand), that's another full work day a year just moving your right hand back and forth between the arrow keys and the home keys.
There is only one school that fails with such vehemence that it makes the down-arrow school look like Nirvana: the mouse school. In the mouse school, you take one hand--typically your dominant hand--off of the keyboard entirely, reaching to some part of the desk that is ergonomically suboptimal (because your keyboard is already in the optimal location). You position your hand on the mouse, and then move the cursor along the screen. It is an analog device, so aim and precision matter, meaning that some people simply do not have the eyesight and dexterity to use the mouse at all: try getting Aunt Myrtle to highlight the letter i in a font where that letter is one pixel wide. You guide the mouse to the pixels that are by the word you want to change, click, carefully drag, and return your hand to the keyboard. The entire process can easily take more than four or five seconds, just to position the cursor. And if you have to scroll through the document to find the point, that's easily ten seconds as prelude to a single edit.
The rabidness of the aforementioned text editor wars comes from the fact that text editing absorbs a huge amount of one's life. If you're like most office drones, most of your time at the computer is spent writing and editing plain text--and you're just one office drone; there are millions in the U.S.A. who are all operating computers basically identical to yours, using a down-arrow/mouse school text editor of some sort. Sure, there are people doing flashy data-slinging with big servers, but the bulk of computing is the literally billions of person hours per year spent editing text. Now multiply that half-second to move the right hand to the arrow key; at this scale, it adds up to millions of person-days per year spent on making that little twitch. With an entirely straight face, I can say that on the order of a billion dollars per year is spent on paying people to hit arrow keys.
When the programmer guys got together and wrote whatever it is you use to write your documents and navigate your web pages, they had all of the paradigms above at hand. Half of these guys are using EMACS or vi themselves. We get frustrated when we ask Mr. Computer Geek for help and he (always a boy, eh) comes back with over-everyone's-head exposition about just opening up regedt and doing a quick ctrl-f for HKEY {343-f2ea53e}. Less blatant but just as insidious is when Mr. Geek assumes you are an idiot. He knows that he knows more about PCs than you do, therefore you are dumb and wholly incapable of learning the reams of knowledge that he has compiled. I have been at many a workplace with IT departments that are stocked with such people; it's only some vestige of courtesy that keeps them from installing drool-guards on all the company keyboards.
Of course, the IT department is thinking about the worst-case users. But when was it ever efficient to force everybody in a several-hundred person organization to work with exactly the tools that the least-able could work with? You may have a legally blind worker at your workplace, but that doesn't mean that every computer in the building needs to operate exclusively at super-magnified resolution. A reasonable approach would be a system where you could select between the various schools of navigation. Most versions of vi let you do this (and EMACS allows ctrl-D, down-arrow, and a limited j-mode), but few down-arrow school programs include the wealth of editing keystrokes that those programs provide.
And so I take Google's j and k keys as a slight victory in a long battle against the forces of condescension. It's just two keys, a far cry from a word processor with a full vi keymap, but it's a sign that the guys who designed and programmed the system felt that it was more important to make usage efficient than to make it drool-proof. As such, it gives me hope that maybe the software of the future might focus on long-term efficiency over the quick sell.
Formatting and ergonomics
Beyond editing, all this applies to formatting in Word too, because you have to use the mouse or an absurd amount of tabbing and arrowing to navigate the menus and dialog boxes to get to the option you want to change. For almost every step of the way, Word eagerly picks intuition over efficiency.
Of course, the most commonly-used features, like boldface, have their own ctrl-key combination, to at least save the user mouse and arrow-key inefficiency for the dozen most commonly-used operations. Also, you can use alt-F to access the File menu, alt-E to use the Edit menu, et cetera.
But even having the few control-key combinations you do have creates problems, because there are only 26 control-letters to use. If they are taken up with the typesetting features of Word, then they can't be used for the plain old editing of text. EMACS and vi give the user fifty-odd keystrokes that edit text (I'm guessing because I couldn't possibly count them all); Word gives you cut, paste, copy, and that's about it. For every other editing task, you have to make do with the arrow keys. The majority of your time putting together a paper is spent writing and editing, so having so many keystrokes at your fingertips for formatting but almost none for editing is backward.
There is no place in Word's intuitive editing model for a key combination to delete a word at a time, to repeat the last edit, to jump to wherever you were last working, or to switch a lowercase letter to capital. But such keystrokes provide immense speed gains to users who have taken the time to learn them. But which do you do more often in a day: skip back to the beginning of a sentence, or switch to boldface?
One reason we have so many formatting commands is--once again--the lack of style sheets, which means that formatting is not produced by listing what you want the formatting to look like, but by applying it over and over again, which means that keystrokes to apply formatting are competing with editing keystrokes for frequency of use. It would be nice to have a dedicated editing program plus a separate dedicated formatting program, but Word's DOC format precludes this.
More on this next time.
Footnotes
[link][no comments]
