Modeling with Data

Testing the model using the model.

10 February 10.

A warning: This entry is filled with fail. The explanation that I pull from another blog has some bugs, but the effect is essentialy right, and my criticism goes overboard. My suggestion to you: just ignore this whole thing.

A well-known cartoonist-geek posted the following little conundrum:

Alice secretly picks two different real numbers by an unknown process and puts them in two (abstract) envelopes. Bob chooses one of the two envelopes randomly (with a fair coin toss), and shows you the number in that envelope. You must now guess whether the number in the other, closed envelope is larger or smaller than the one you've seen.

Is there a strategy which gives you a better than 50% chance of guessing correctly, no matter what procedure Alice used to pick her numbers?

His solution is wrong, but in an interesting way. In fact, it gets at a major flaw in scientific research as practiced today. Here's his incorrect solution:

Strategy:

Call the number you saw x. Use the logistic function to calculate p(x) = 1/(1 + e^-x). Choose a random real number between 0 and 1. If it's lower than p(x), guess “lower”. Otherwise, guess “higher”.

In other words, guess “lower” with probability p(x), and “higher” with probability 1 - p(x).

This works because p(x) is a function which is 0 at negative infinity, 1 at positive infinity, and increases monotonically in between. So for any pair of numbers, if you call the smaller one A and the larger one B, you have a (slightly) better chance of guessing “lower” if you saw B than if you saw A. Your overall chances of guessing correctly — given that there is a 50% chance you're seeing A and a 50% chance you're seeing B — are:

N = 0.5*(1 - p(A)) + 0.5*p(B)

(remembering that p(x) is 1/(1 + e^-x))

Since p(A) is always smaller than p(B), N is always greater than 50%. If Alice picked 1.0 and pi, your chances of guessing correctly are 61.37%. If Alice picked 10 and 11, your chances of guessing correctly are 50.00053%. No matter what two numbers Alice picked, the chance of guessing right is always better than 50%. Sometimes it's only better by a tiny, tiny amount, but it's always better.

Our geek hero's method really boils down to this:

Assume a distribution over all elements of the range over which Alice draws. This might be the Normal (Gaussian) distribution, or something crazy that focuses on even numbers, or who knows what. The author chose a Logistic distribution.
Write down its cumulative distribution function, which is a function CDF(x) that gives the percent of the distribution that is less than x. It has the properties described above: zero at negative infinity and one at positive infinity.
Make your random draw, then chek your CDF(x) table to find the odds that the true value is less than x. If the integral of the prob. distribution up to x is 92% of the distribution, then you've got 92% odds that the other envelope is smaller.

Of course, the odds you get at the end are in the context of the model you assumed in step one. That is, the odds calculation that the solution used, N = 0.5*(1 - p(A))..., uses the function p(⋅) that the author picked, not one that is objectively true.

In other words, we assumed a subjective distribution, and then tested our results against the subjective distribution. This says nothing about the true odds of anything Alice may do.

It is very much worth picking apart why this is wrong, and why the statistical illusions are sufficiently seductive that a smart guy like our cartoonist-geek would fall for them. But first, a comic interlude.

Three guys are stranded on a desert island

And all they have to eat is a case of canned pears. The joke is that they're all researchers.

The physicist says: `we can mill down these coconut husks into lenses, then focus the heat of the sun on the cans. When their temperature rises enough, the seams will burst!'

The chemists says: `No, that'll take too long. Instead, we can refine sea water into a corrosive, that will eventually just melt the can open!'

The biologist cuts him off: `I don't want salty pears! But I've found a yeast that is capable of digesting metals. With care and cultivation, we can get them to eat the cans open.'

The economist finally stands up and smiles: `You are all trying to hard, because it's very simple: assume a can opener.'

Pause for laughter.

I've found that this joke is so commonly told among economists that you can just tell an economist `you're assuming a can opener' and they'll know what you mean. It's also a good joke for parties because people always come up with new ways to open the can. What would the lit major do?

An improper distribution

The problem in the envelope problem above is that there is no way to describe a truly neutral belief that any real number is possible. The best we can do is an Improper Uniform, which is a common dummy distribution where P(x) = 1 for all values of x. It integrates to infinity, not one, so it isn't a true probability distribution. But it's all we know.

I think that people who approach the above problem really do want to apply something like the procedure above, in the sense that there's an intuitive drive to write down some odds of any given value appearing in the other envelope. We're ignorant, but we really want to say something anyway.

The first conundrum is what to do with the infinite domain of the probability distribution. Those who don't do much in the way of statistics, and engineers, will often jump on the practical impossibility of drawing any real number--you'll see this in the comment thread on the linked page. Taking practicalities into account, you can set a uniform distribution over, say, the range of a PC's long double floating-point numbers. That's a well-defined Uniform distribution.

Those who have had a stats class or two know that the infinite domain is not an issue. Look at the Normal distribution: for a Normal with mean zero and variance one, a value of a million is not one to bet on, but still has a supremely small sliver of possibility. The author chose the Logistic distribution, which also has infinite domain.

The nice thing about the Normal, or the distributions over a finite span, is that the CDF is well-defined, so you can follow the procedure above: write down a distribution, make a random draw, find the integral up to your random draw.

For an improper uniform, this breaks. The CDF is either undefined or defined by custom, as you prefer. In the Apophenia library, apop_cdf(your_data, apop_improper_uniform) returns 0.5 no matter what data you input. One could argue that it should return INFINITY every time, and I'd respect that opinion too.

For this case, a half expresses the idea better: once you open envelope A, the odds of B's value being higher is still just fifty-fifty. So the Improper Uniform really gets it right here, but in a manner that makes people queasy, because the probability distribution is ill-defined, as revealed by its failure to have a CDF.

I could go on about the assumptions behind the Logistic distribution, but you can surely spot most yourself. It assumes zero is the halfway point, and that values closer to plus or minus infinity are less likely than values near zero. It assume that 10 is less likely than nine (i.e. that Alice is trying to not behave like a human). The setup above also implicitly assumes that A and B are independently drawn. So the author made a number of assumptions in the form of Logistic distribution, and then used those assumptions to assign subjective probabilities. That's a fine human way to approach the problem, but works exactly to the extent that you believe the assumptions.

Cracking open a model with no tools

Now back to the real world. You are running the numbers on a model regarding data you have collected. To keep this simple, let's say that you're running an Ordinary Least Squares regression on a data set of canned pear sales and education levels. You have the data set, then run OLS to produce a set of coefficients, β, and p-values indicating the odds that the βs are different from zero.

Those p-values are generated using exactly the procedure listed above: assume a distribution of the βs, write down its CDF, then measure how much of the CDF you assumed lies between zero and β.

In this case, we have a well-established distribution, so the mind-benders about a distribution with infinite range and so on are gone. But we're still assuming a can opener. We used the assumptions of the model--that errors are normally distributed with mean zero and a variance that is a function of the data--to state the confidence with which we believe the very same model.

To make this as clear as possible: we used the model assumptions to write down a probability function, then used that probability function to test the model. This is exactly how out blog author turned no information at all into an odds measured to five decimal places. But making an assumption does not add information.

Pick up any empirically-oriented journal, and in every paper, this is how the confidence intervals will be reported, by assuming that the model is true with certainty and can be used to objectively state probabilities about its own veracity.

So ¿why doesn't all of academia fall apart?

First, many of the assumptions of these models are rooted in objective fact: given such-and-such a setup, errors really will be Normally distributed. We could formalize this by writing down tests to test the assumptions of the main model, though for our purposes there's no point--they'll just fall victim to the same eating-your-own-tail problem. Even lacking extensive testing, if the data generation process is within spitting distance of a Central Limit Theorem, we'll give it benefit of the doubt that there is an objective truth to the distribution.

Second, we can generalize that point to say that the typical competently-written journal article's assumptions are usually pretty plausible, or at least do little harm. When they report that one option is more likely than another, that is often later verified to actually be true, though the authors had used subjective tools to state subjective odds.

Third, we shouldn't believe p-values--or any one research study--anyway. A model with fabulous p-values will increase our subjective confidence that something real is going on. But if you read that a p-value is 99.98%, ¿do you really believe that in exactly two out of 10,000 states of the world, the difference is not significant? Probably not: you just get a sense of greater confidence.

So this works because we treat the process as subjective. The authors made up a model, and used that model to state the odds with which the model is true. But if we agree that the model seems likely, and if we accept that the output odds are just inputs to our own subjective beliefs, then we're doing OK. Problems only arise when we pretend that those p-values are derived from some sort of objective probability distribution rather than the author's beliefs as formalized by the model.

[link] [A comment]
[Previous entry: "Keeping paper current"]
[]

Replies: A comment

on Tuesday, March 2nd, Ted Alper said

Wait, I think you've allowed your focus on models to cause you to misunderstand the intended meaning of the [correct!] solution given in xkcd -- and lose the beauty of the main point, which is a way of systematically favoring larger numbers over smaller ones, even without knowing how the numbers were generated. The model Bob uses to pick his number need not have anything to do with the model Alice uses -- all that's necessary is that it has some positive measure in every subinterval. It's true that his chance of success is BEST if his model matches hers, but he'll do better than 50% because his model shares one important feature with hers:
Both Bob's model and Alices share the property that their cumulative distribution functions are non-decreasing. [Bob's also needs to be monotonically increasing, but Alice's doesn't]

Comment!

h for human:
Name:
E-Mail:
Homepage:

	Remember personal information?

Modeling With Data

Testing the model using the model.

Three guys are stranded on a desert island

An improper distribution

Cracking open a model with no tools