Tip 54: Put functions in your structs

17 January 12. [link] PDF version

level: your structures are getting diverse
purpose: maintain a standard form over diversity

Part of a series of tips on POSIX and C. Start from the tip intro page, or get 21st Century C, the book based on this series.

Since the last two episodes included some long sample code, this time I'm going to just tell you about Apophenia, the library of stats functions that co-evolved with Modeling with Data.

One of the key data structures is intended to represent a statistical model. Broadly, statistical models are all the same: the box diagram would have parameters on one side (think of the mean and variance of a Normal distribution, μ and σ), then another arrow for the data (a data point, x), and the black box would spit out a probability, P(x| μ, σ).

So our struct will have elements named parameters and data, which are not especially challenging. There should be a probability function to go from parameters and data to P(parameters, data), and here we run into a problem: there is a different calculation for every model. The math for a Normal distribution has nothing to do with the math for a Poisson distribution or a Binomial distribution. So having a single function prob(parameters, data, model) won't work.

The cleanest solution is to put a slot for the function inside the struct typedef, and associate a new function with every struct:

typedef double (*model_to_double)(apop_model *);

typedef struct {
apop_data *parameters, *data;
model_to_double *prob, *log_likelihood;
} apop_model;

Simulation models typically use the probability, but people who work with probability distribution-based models tend to work with the log probability (i.e., the log likelihood, where the distinction between probability and likelihood is irrelevant for our purposes), so I threw in a slot for both.

Now, when we define a new model object, we set the functions as needed.

/* This is very cut down from the Apophenia source, and I'm not going
to explain the math here.
Focus on the declaration below the function.
*/
static double apply_me(double x, void *mu){ return gsl_pow_2(x - *(double *)mu); }

static double normal_log_likelihood(apop_data *d, apop_model *params){
//check that params->parameters is not null here.
double mu = apop_data_get(params->parameters, 0, -1);
double sd = apop_data_get(params->parameters, 1, -1);
return -apop_map_sum(d, .fn_dp = apply_me, .param = &mu)/(2*gsl_pow_2(sd))
- tsize*(M_LNPI+M_LN2+log(sd));
}

apop_model apop_normal= {.name="Normal distribution", .log_likelihood=normal_log_likelihood};

Hey, wait--after all that I didn't define the probability function. But it's easy to calculate: the log probability is log(probability), and probability = exp(log likelihood). Instead, here is a dispatch function prob, which lives outside of the struct, and would be used the way all the functions worked before we started putting methods inside structs. We usually see the struct as the first input to such functions, but Apophenia's rule for ordering the inputs is that the data always comes first.

double prob(apop_data *d, apop_model *p){
assert(p->parameters); //I expect users have set this before calling.
if (p->prob)
return p->prob(d, p);
else if (p->log_likelihood)
return exp(p->log_likelihood(d, p));
else Apop_error("I need either the prob or log likelihood "
"methods in the input model.");
}

If there were a sensible default method for calculating the probability given data and parameters, we'd put it here in the dispatch function. For the estimation of a model (find the most likely parameters given data), there actually is a default, in the form of maximum likelihood methods. So the estimate routine looks like

//again, this is cut.
apop_model * apop_estimate(apop_data *d, apop_model *p){
if (p->estimate)
return p->estimate(d, p);
else
return apop_maximum_likelihood(d, p);
}

If we put a function inside of the struct, then we need to point the struct to the right function on initialization every time. If we can reasonably expect that the function will be different every time, then that's exactly what we need.

If you have a new_struct function that gathers together the input data and functions and spits out a cleaned-up struct, then you can use that setup function to assign a default function, the way that apop_estimate used the maximum likelihood function when the struct didn't have its own estimate routine.

I prefer to set up structures using designated initializers, so that means I need a dispatch function that checks for the method being called and uses a default as needed.

I want this
Further, let me point out a sad fact about methods inside of structs. You pretty much always need to send in the struct itself, which means redundancy. If we didn't have the dispatch function, we'd have redundant calls like normal_dist.p(data, normal_dist).

C++-type languages have a special rule that the first element of a function inside a struct will be the struct itself. Either you write a method with the appropriate first argument, and so a header like normal_estimate(apop_model this, apop_data *d); or the system may just define a special variable this to point to the parent struct.

C doesn't define magic variables for you, and it is always honest and transparent about what parameters get sent in to a function. Normally, if we want to futz around with the parameters of a function, we do it with the preprocessor, which will gladly rewrite f(anything) to f(anything else). However, all of the transformations are a function of what goes on inside of the parens. There's no way to get the preprocessor to transform the text s.prob(d) to s.prob(s, d). If you don't want to slavishly imitate C++-type syntax, you can write a macro

#define prob(s, ...) s.prob(s, __VA_ARGS__)

But now you've cluttered up the global namespace with this prob symbol. So there's one more reason to use a dispatch function: it can take care of the redundancy for you.