Uncertain ABM settings

24 July 13. [link] PDF version

Part of a series of posts that started here, riffing on this paper.

$\def\Re{{\mathbb R}} \def\datas{{\mathbb D}} \def\params{{\mathbb P}} \def\models{{\mathbb M}} \def\settings{{\mathbb S}} \def\mod#1{M_{#1}} \def\muv{\boldsymbol{\mu}} \def\Sigmav{\boldsymbol{\Sigmav}}$

Last time, I talked about how a good narrative model accommodates uncertainty all the way through, as opposed to a model that posits a deterministic relationship $f(x)$ and then fuzzes it into $f(x) + \epsilon$, where $\epsilon$ is random error.

Bayesian Updating is one means of adding uncertainty earlier in the pipelin, but it has a specific form: we start with a prior model with parameter and data space $\params_p$ and $\datas_p$ and a likelihood model with spaces $\params_L$ and $\datas_L$. If $\datas_p=\params_L$, the it makes sense to draw $RNG_p(p)$, with $p\in\params_p$, and then accept or reject them based on the likelihood $L_L(d, p)$, where $d\in \datas_L$. This formulation is a bit like a stream of water with a valve, or a transistor: a stream of draws from $\datas_p$ comes through, and the likelihood allows some portion of the flow through but blocks other parts. The final product must be a set of observations in $\datas_p=\params_L$, or if you can parameterize it, new parameters in $\params_p$.

But there may be other parts to a model. Here, I make a distinction between parameters and settings, where I define those two classes of variable with a question: would you optimize over the variable with a maximum likelihood search? If yes, then it's a parameter; else it's a setting. For example, our model might put out a histogram, and the number of bins is worth tweaking but typically not optimized. A noncentral $t$ distribution can be defined to have mean and variance parameters $\mu$ and $\sigma$, but one typically does not optimize over the degrees of freedom. Agent-based models invariably have settings like the number of agents or whether they travel on diagonals or the number of other agents an agent can connect with in a period. The distinction is a little subjective, and there are cases where one would optimize over the number of bins and such, but that decision defines the model's parameter space, and a different decision defines a different parameter space and thus a different model.

There are people who are skeptical of any model with settings. There's a certain simplicity to a Normal distribution with its two parameters and zero settings—even though it's really a noncentral $t$ distribution with the degrees of freedom ($df$) setting fixed at the clearly incorrect value of $df=\infty$. It is clear that nobody cheated by picking a convenient value of the settings because there are no settings to set.

This is an æsthetics argument, not really about scientific inquiry. But there is one thing we can do: we can add uncertainty to the settings. It's not a prior, because in today's example the likelihood function (a data-composition between a search microsimulation and a Weibull distribution) has three spaces: $\params_L$, $\datas_L$, and a settings space $\settings_L$. We'll add a `prior', with $\datas_p = \settings_L$, and then observe the distribution of parameters ($\in \params_L$). There is no explicit prior distribution over $\params_L$ but we'll get one as output.

A spatial ABM of search

In today's example, we'll have two types of agents. You can think of them as boys and girls, or employees and employers, or just Type A and Type B.

This is a spatial search model:

We can expect that initially, when there are a lot of agents wandering around, that some agents will find each other quickly. Eventually, we'll be down to only one A agent and one B agent, and they will wander for a long time before pairing up.

In this case, the ratio of agent count to grid size really matters: 90 agents on a $10\times 10$ grid is a very different search from 90 agents on a $1,000 \times 1,000$ grid.

This is the most involved simulation to date, which means that it looks like the most typical C program to date. Also, it's fully 90 lines though I could play golf and cut the line count some more, at the cost of readability. Passing on the C trickery (the elaborate macros, the mechanics of for loops, and all the pointer use without malloc or free), here are some notes that might be interesting to an ABMer:


#include <apop.h>
#include <stdbool.h>

typedef struct{
    gsl_rng *rng;
    int x, y;
    char type;
    bool done;
} agent_s;

#define gridpt(x, y) grid[(x)*grid_size + (y)]

#define xoff_yoff_loop(...)                     \
    for (int xoff=-1; xoff <=1; xoff++)         \
        for (int yoff=-1; yoff <=1; yoff++) {   \
            if (a->x + xoff >= grid_size || a->x + xoff < 0 || \
                a->y + yoff >= grid_size || a->y + yoff < 0 || (!xoff & !yoff) ) \
                continue;                       \
            __VA_ARGS__                         \
        }

agent_s *search_for_mate(agent_s *a, agent_s **grid, int grid_size){
    xoff_yoff_loop (
        agent_s *b = gridpt(a->x+xoff, a->y+yoff);
        if (b && b->type!=a->type) return b;
    )
    return NULL;
}

void step(agent_s *a, agent_s **grid, int grid_size){
    int open_ct = 0;
    xoff_yoff_loop (
        if (!gridpt(a->x+xoff, a->y+yoff)) open_ct++;
    )
    if (!open_ct) return; //landlocked, can't move
    int move = gsl_rng_uniform(a->rng) * open_ct;
    xoff_yoff_loop (
        if (!move--) {
            gridpt(a->x, a->y) = NULL;
            a->x += xoff;
            a->y += yoff;
            gridpt(a->x, a->y) = a;
            return;
        }
    )
}

void generate_agents(agent_s **grid, int grid_size, int pop_size, agent_s *out){
    for(int i=0; i< pop_size; i++){
        agent_s *a = out+i;
        *a = (agent_s){.rng = apop_rng_alloc(apop_opts.rng_seed++),
                       .type = (i % 2) ? 'A' : 'B'};
        do{ //find a vacant spot
            a->x = gsl_rng_uniform(a->rng) * grid_size;
            a->y = gsl_rng_uniform(a->rng) * grid_size;
        } while (gridpt(a->x, a->y));
        gridpt(a->x, a->y) = a;
    }
    out[pop_size] = (agent_s){}; //empty stopper.
}

void run_sim(double *durations, gsl_rng *r, apop_model *m){
    int grid_size = ((double*)m->more)[0];
    int pop_size = ((double*)m->more)[1];
    int done_ctr = 0, period = 1;
    pop_size *=2; //guarantee evenness.
    assert(pop_size <= pow(grid_size,2));

    agent_s alist[pop_size+1];
    agent_s *grid[grid_size * grid_size];
    memset(grid, 0, grid_size*grid_size*sizeof(agent_s*));
    generate_agents(grid, grid_size, pop_size, alist);
    
    do {
        for (agent_s *a=alist; a->rng; a++){
            if (a->done) continue;
            agent_s *b;
            if (a->type=='A' && (b=search_for_mate(a, grid, grid_size))){
                gridpt(a->x, a->y) = gridpt(b->x, b->y) = NULL;
                a->done = b->done = true;
                durations[done_ctr++] =period;
            }
            step(a, grid, grid_size);
        }
        period ++;
    } while (done_ctr < pop_size/2);
}

apop_model search_sim = {"A search on a grid", .draw=run_sim};

At the end of this, I have a list of agent exit times. The rate at which agents exit changes with time. This is also the story of a Weibull distribution, so it'd be interesting to see how well one fits. A Weibull has two parameters: $\lambda$ follows your intuition from Exponential or Poisson distributions: if $k=1$, then $\lambda$ is the mean time to exit. If $k<1$, then the time to exit is slower for those that are still present later in the game: the more attractive (or more centrally-located) get picked up quickly, and the hard-to-match take several $\lambda$s' time to find a match. So for this simulation, we expect that $\lambda$ grows as the grid gets more sparse, and $k$ should be noticeably less than one.

The code below sets up a Weibull model as a first-class model. As with most models that assume iid, the likelihood of one element can be written easily, and then the total likelihood is the sum of the map of that function onto every element in turn. I had to add a constraint to keep the MLE search routine from setting $\lambda$ or $k$ to zero.

The one_run function sets the settings for the simulation, makes a few thousand draws, and estimates the parameters of a Weibull using those draws. It's just warm-up so we know that everything works.

The real fun is in putting a prior on the grid size and population settings, as the fuzz function does. The output is a PMF of 100 Weibull parameters produced using maximum likelihood estimation based on 1,000 draws from a simulation that typically runs for hundreds or thousands of periods—and it still runs fast enough on my cheap laptop.

Given the output draws of $(\lambda, k)$, you can plot and admire them, as I did. I'm sticking to my policy of not displaying output, because it's the last incentive I can give to get you to run the code yourself.


#include <apop.h>
extern apop_model search_sim;

double one_weibull(double d, void *params){
    double lambda = apop_data_get(params, 0);
    double k = apop_data_get(params, 1);
    return logl(k) - logl(lambda)
        + (k-1)*(logl(d) - logl(lambda))
        - powl(d/lambda, k);
}

static long double positive_params(apop_data *data, apop_model *v){
    return apop_linear_constraint(v->parameters->vector);
}

long double weibull_ll(apop_data *d, apop_model *m){
    return apop_map_sum(d, .param = m->parameters, .fn_dp=one_weibull, .part='a');
}

apop_model weibull = {"The Weibull", .vsize=2, .log_likelihood = weibull_ll,
            .constraint=positive_params};

void one_run(int grid_size, int pop_size){
    printf("—— A run with a %i X %i grid and %i agents:\n", grid_size, grid_size, pop_size);
    search_sim.dsize = pop_size;
    double params[2];
    search_sim.more = params;
    params[0] = grid_size;
    params[1] = pop_size;
    apop_model *model_out = apop_estimate(apop_model_draws(&search_sim, 1000), weibull);
    apop_model_show(model_out);
}

apop_model *fuzz(apop_model sim){
    int draws = 100;
    gsl_rng *r = apop_rng_alloc(1);
    apop_model *prior = apop_model_stack(
                            apop_model_set_parameters(apop_normal, 10, 2),
                            apop_model_set_parameters(apop_normal, 10, 2));
    apop_data *outdata = apop_data_alloc(draws, weibull.vsize);
    apop_prep(NULL, &sim);
    double params[2];
    sim.more = params;
    for (int i=0; i< draws; i++){
        do {
        apop_draw(params, r, prior);
        } while (params[1]*2 > pow(params[0], 2));
        sim.dsize=params[1];
        apop_model *est = apop_estimate(apop_model_draws(&sim, 1000), weibull);
        Apop_matrix_row(outdata->matrix, i, onerow);
        gsl_vector_memcpy(onerow, est->parameters->vector);
        apop_model_free(est);
    }
    return apop_estimate(outdata, apop_pmf);
}

int main(){ 
    one_run(10, 10);
    one_run(100, 10);
    one_run(10, 45);

    apop_model *fuzzed = fuzz(search_sim);
    apop_data_print(fuzzed->data,.output_file="outdata");
}

[Previous entry: "Bayesians and ABMs"]
[Next entry: "Let us not forget the basics"]