Modeling with Data

Updates

Modeling with Data is a bound book; Apophenia is an evolving code base. See this blog post for a full discussion of the conflict. Here, I point out specific features and improvements to Apophenia that a reader of Modeling should be aware of. Most are improvements, meaning that they make work easier than the procedures described in the text, but some do break the code in the book.

Optional arguments

Many functions in Apophenia now accept optional, named arguments . For example, apop_data_get_ti, apop_data_get_tt, and so on, are now rolled into apop_data_get. E.g., given the right setup, here are some valid uses of the expanded functions:
apop_data_get(dataset, 3, .colname="population"); apop_data_set(dataset, .colname="population", .row=3, .val=174); double *group4pop = apop_data_ptr(dataset, .rowname="Group 4", .colname = "population");

Yes, this is 100% standards-compliant C.

p 130: the awkward instructions on using apop_dot are now incorrect and obsolete. Notably, in the case where d1 is a vector and d2 a matrix, then apop_dot(d1,d2,'t') won't work, due to the order by which the arguments are filled. Instead use apop_dot(d1,d2,.form2='t') or apop_dot(d1,d2,0, 't').

Setting settings

I regret the inconvenient and unpleasant methods of initializing settings groups for the models. The optional argument syntax simplifies the settings setup. Instead of this (taken from lrtest.c, p 352): Apop_settings_add_group(&apop_ols, apop_mle, &apop_ols); Apop_settings_add(&apop_ols, apop_mle, starting_pt, unconstr->parameters->vector->data); Apop_settings_add(&apop_ols, apop_mle, use_score, 0); Apop_settings_add(&apop_ols, apop_mle, step_size, 1e-3);
use:
Apop_settings_add_group(&apop_ols, apop_mle, .parent=&apop_ols, .starting_pt= unconstr->parameters->vector->data, .use_score= 0, .step_size= 1e-3);

Options not specified get set to defaults as before.

Pages of data

apop_data sets now have a ->more pointer, that can point to a second page of data. That second page has its own ->more pointer, meaning that you can staple on as many pages of data to your main data set as you wish. The intent is not to produce true 3-D data sets, but to include auxiliary information, like covariances or confidence intervals.

This is where the incompatibilities with the book come in. Many of the elements of the apop_model have been moved to pages of the data or parameters. Let m be an apop_model that has been estimated, maybe using maximum likelihood. Then:

m->llikelihood, the scalar log likelihood given the model, parameters, and data, is now in an info page attached to the parameters. Retrieve using:
apop_data_get(m->parameters, .col=-1, .rowname="log likelihood", .page="info");
m->status, for those routines that produce a status value, has been moved to the info page as well:
apop_data_get(m->parameters, .col=-1, .rowname="status", .page="info");
m->covariance is a property of the parameters more than the model (to the extent that such joint lineage can be disentangled):
apop_data *cov = apop_dat_get_page(m->parameters, "covariance"); m->expected_value is now merged with prediction, and is a page stapled to the data set:
apop_data *predict_tab = apop_data_get_page(m->data, "predicted"); gsl_vector *o = Apop_col_t(predict_tab, "observed"); gsl_vector *p = Apop_col_t(predict_tab, "predicted"); gsl_vector *r = Apop_col_t(predict_tab, "residual");

These changes somewhat declutter the apop_model, and leave room for further extension. For example, most routines will also put the AIC and BIC on the info page along with the log likelihood.

Apop_row

Generates a view consisting of a full apop_data set with one row, not just a vector.

Histograms

I relied on the GSL's histograms for the purposes of binning, but for our purposes, I now recommend using a simple data set with the weights vector serving as the bins for the histogram. This has the advantages of letting us use existing data set/vector functions (plus a few new PMF-specific functions), accommodating N-dimensional and even text data, and being naturally sparse (because if a bin has no data in it, it's not in the data set). However, GSL histograms will be much faster than this setup for accumulating a not-too-sparse 1- or 2-D histogram. The really major changes are in smoothing.c and goodfit.c; when you download the sample code, you will find that those two scripts have been entirely rewritten.

vtables and the `apop_model` struct

To keep the apop_model from getting too bloated, some elements were removed, although there are still special-case calculations for some models. For these functions, there is a vtable that keeps a list of alternate functions for special cases. See the page on writing new models for details.

Also, the p, log_likelihood, and constraint functions return long double, not double as they had.

The estimate method returns NULL. It modifies the model sent in as a parameter. However, apop_estimate works as before.

Removed functions

A few functions have turend out to not be very useful. There is a lot of redundancy between matrices and apop_data sets, with little benefit. Functions that do specific Gnuplot tricks have been removed, because it's not Apophenia's place to tell you what plotting package to use; find those functions in the wiki.

apop_array_to_data and apop_array_to_matrix, because double**s are uncommon in present-day code.
apop_db_merge and apop_db_merge_table. Get the code from this page.
apop_matrix_correlation. Use apop_data_correlation.
apop_matrix_fill and apop_line_to_(data|matrix). Use apop_data_fill and apop_data_falloc. .
apop_vector_to_array.
apop_vector_grid_distance. Use apop_vector_distance(v1, v2, .metric='M').

Modeling With Data