Mapping functions

21 August 13. [link] PDF version

This is another note on useful things Apophenia does; today's entry is about applying a function to every element of a matrix, vector, or data set.

This form helps in thinking about program flow, and makes for clearer writing. It is also a good hook for threading, and for some situations gives you threaded programs almost for free. This is the map/reduce paradigm: get the log likelihood of every observation separately, perhaps on different cores, then sum up the likelihoods to get the total log likelihood for the data set.

In some interpreted languages there's some sort of nontrivial difference between a simple for loop over all the rows in a data set and a vectorized function, perhaps because the interpreter has to drop down to C on every line of the for loop, but consolidates those calls via the vectorizer. In C, there is exactly zero overhead to calling C functions, so having a function-applying function is thus largely a convenience to help with writing readable code and to make threading easier.

You would first write a callback function, a function whose sole intent is to be sent to the map or apply function; for finding log likelihood of a data set, this would be the one-observation log likelihood. Then the mapping function has the task of applying the function to every element of the data set.

Here's a simple example, which fills a $5\times 5$ matrix with random draws by calling the draw callback for every element, then produces a new matrix that takes the log of those elements by applying the ln callback to every element. The apply function changes the input matrix in place; the map function uses the callback to map the input matrix to a new output matrix.

#include 

gsl_rng *r;

void draw(double *in){*in = gsl_rng_uniform(r);}
double ln(double in){return log(in);}

int main(){
    r = apop_rng_alloc(123);
    apop_data *d = apop_data_alloc(5, 5);
    apop_matrix_apply_all(d->matrix, draw);
    gsl_matrix *log_d = apop_matrix_map_all(d->matrix, ln);
    apop_data_show(d);
    printf("\n");
    apop_matrix_show(log_d);
}

This was my first draft of a mapping setup, with several functions (which you can look at in the appropriate segments of the Apophenia documentation). Notice how the random number generator gets passed in to the draw function: it's just global to the file. This is universally deemed to be a lazy way to do it, but I never removed it from the Apophenia interface because I still use in my scripts all the time. There isn't really a `global scope' in C, just scope from one point to the end of the file, and sometimes that's not fatal to readability or maintainability. However, if the global variable isn't read-only (and the RNG isn't), then you can't use this for threading.

A replication example

The next extended example is taken from the blog of Nathan Lemoine, who used a replication experiment as a timing test between R and Python.

I'm out of the timing-test business, because I've come to realize that the people who use slow platforms are the people who don't care about speed and evidently never face situations where speed matters. The C code here is still faster than the R code a commenter on that page offered using lm.fit; I imagine that it took me about as long to write the sample code here as it would take an R partisan to write the sped-up R version; and it's up to you to decide whether C fits into your workflow.

In the setup, create a fake data set, where the first dependent column is an index and the second is a draw from ${\cal N}(0, 100)$, and the dependent variable is 2*index + the draw. Then run a regression, producing an expected value of the dependent variable and a residual. Then for the 10,000 replications, build a fake dependent variable by adding the shuffled-up residuals to the same expected values, and re-run the regression. Mark down the 10,000 coefficients on the second term. Report the mean of those coefficients. I'll have more notes on the use of the mapping functions below, and as usual, it helps to read the bottom function first.


#include <gsl/gsl_permute_vector.h>
#include <apop.h>

typedef struct {
    gsl_rng *r;
    apop_model *model;
} rowset_s;

double set_a_row(apop_data *in, void *in_struct, int index){
    rowset_s *s = in_struct;
    double draw;
    apop_draw(&draw, s->r, s->model);
    apop_data_fill(in, 2*index + draw, index, draw);
    return 0;
}

typedef struct {
    apop_data *data;
    gsl_vector *resid, *yhat;
} rep_s;

double one_replication(apop_data *ignore, void *in, int index){
    gsl_rng *r = apop_rng_alloc(index);
    rep_s *rs = in;
    gsl_permutation * p = gsl_permutation_alloc (rs->resid->size);
    gsl_permutation_init(p);
    gsl_ran_shuffle(r, p->data, rs->resid->size, sizeof(size_t));
    
    //Can't write to the input, so build a new one.
    //point to the original matrix of independents,
    //but generate a new dependent var in alt_data->vector.
    apop_data *alt_data = apop_data_alloc();
    alt_data->matrix = rs->data->matrix;
    alt_data->vector = apop_vector_copy(rs->resid);
    gsl_permute_vector(p, alt_data->vector);
    gsl_permutation_free(p);
    gsl_vector_add(alt_data->vector, rs->yhat);

    apop_model *mb = apop_estimate(alt_data, apop_ols);
    double out = apop_data_get(mb->parameters, .row=1, .col=-1);

    //Clean up. Even at this scale, this is optional.
    alt_data->matrix = NULL;
    apop_data_free(alt_data);
    apop_model_free(mb);
    gsl_rng_free(r);
    return out;
}

int main(){
    apop_opts.thread_count = 4;
    gsl_rng *r = apop_rng_alloc(123);
    apop_data *d = apop_data_alloc(101, 3);
    apop_map(d, .fn_rpi=set_a_row, 
                .param=&((rowset_s){.model=apop_model_set_parameters(apop_normal, 0, 10), .r=r}));
    apop_model *m = apop_estimate(d, apop_ols);
    apop_model_show(m);

    apop_data *rpage = apop_data_get_page(m->info, "<Predicted>");
    Apop_col_t(rpage, "Predicted", yhat)
    Apop_col_t(rpage, "Residual", resid)

    int runs = 10000;
    Apop_model_add_group(&apop_ols, apop_parts_wanted);
    apop_data *out = apop_data_alloc(runs);
    apop_map(out, .fn_rpi=one_replication, .inplace='y',
                  .param = &((rep_s){.data=d, .yhat=yhat, .resid=resid}));

    printf("mean replicated coefficient = %g\n", apop_mean(out->vector));
}

[Previous entry: "Interacting with C code"]
[Next entry: "Vector or matrix"]