Updates
Modeling with Data is a bound book; Apophenia is an evolving code base. See this blog post for a full discussion of the conflict. Here, I point out specific features and improvements to Apophenia that a reader of Modeling should be aware of. Most are improvements, meaning that they make work easier than the procedures described in the text, but some do break the code in the book.
Optional arguments
Many functions in Apophenia now accept optional, named arguments . For example, apop_data_get_ti, apop_data_get_tt, and so on, are now rolled into apop_data_get. E.g., given the right setup, here are some valid uses of the expanded functions:
apop_data_get(dataset, 3, .colname="population");
apop_data_set(dataset, .colname="population", .row=3, .val=174);
double *group4pop = apop_data_ptr(dataset, .rowname="Group 4", .colname = "population");
Yes, this is 100% standards-compliant C.
p 130: the awkward instructions on using apop_dot are now incorrect and obsolete. Notably, in the case where d1 is a vector and d2 a matrix, then apop_dot(d1,d2,'t') won't work, due to the order by which the arguments are filled. Instead use apop_dot(d1,d2,.form2='t') or apop_dot(d1,d2,0, 't').
Setting settings
I regret the inconvenient and unpleasant
methods of initializing settings groups for the models. The optional argument syntax
simplifies the settings setup. Instead of this (taken from lrtest.c, p 352):
Apop_settings_add_group(&apop_ols, apop_mle, &apop_ols);
Apop_settings_add(&apop_ols, apop_mle, starting_pt, unconstr->parameters->vector->data);
Apop_settings_add(&apop_ols, apop_mle, use_score, 0);
Apop_settings_add(&apop_ols, apop_mle, step_size, 1e-3);
use:
Apop_model_add_group(&apop_ols, apop_mle, .parent=&apop_ols,
Options not specified get set to defaults as before.
Pages of data
apop_data sets now have a ->more pointer, that can point to a second page of data. That second page has its own ->more pointer, meaning that you can staple on as many pages of data to your main data set as you wish. The intent is not to produce true 3-D data sets, but to include auxiliary information, like covariances or confidence intervals.
This is where the incompatibilities with the book come in. Many of the elements of the apop_model have been moved to pages of the data or parameters. Let m be an apop_model that has been estimated, maybe using maximum likelihood. Then:
-
m->llikelihood, the scalar log likelihood given the model, parameters, and data, is now in an info page attached to the parameters. Retrieve using:
apop_data_get(m->parameters, .col=-1, .rowname="log likelihood", .page="info"); -
m->status, for those routines that produce a status value, has been moved to the info page as well:
apop_data_get(m->parameters, .col=-1, .rowname="status", .page="info"); -
m->covariance is a property of the parameters more than the model (to the extent that such joint lineage can be disentangled):
apop_data *cov = apop_dat_get_page(m->parameters, "covariance"); m->expected_value is now merged with prediction, and is a page stapled to the data set:
-
apop_data *predict_tab = apop_data_get_page(m->data, "predicted");
gsl_vector *o = Apop_col_t(predict_tab, "observed");
gsl_vector *p = Apop_col_t(predict_tab, "predicted");
gsl_vector *r = Apop_col_t(predict_tab, "residual");
Histograms
I relied on the GSL's histograms for the purposes of binning, but for our purposes, I now recommend using a simple data set with the weights vector serving as the bins for the histogram. This has the advantages of letting us use existing data set/vector functions (plus a few new PMF-specific functions), accommodating N-dimensional and even text data, and being naturally sparse (because if a bin has no data in it, it's not in the data set). However, GSL histograms will be much faster than this setup for accumulating a not-too-sparse 1- or 2-D histogram. The really major changes are in smoothing.c and goodfit.c; when you download the sample code, you will find that those two scripts have been entirely rewritten.
Removed functions
A few functions have turend out to not be very useful. There is a lot of redundancy between matrices and apop_data sets, with little benefit.
- apop_array_to_data and apop_array_to_matrix, because double**s are uncommon in present-day code.
- apop_db_merge and apop_db_merge_table. Get the code from this page.
- apop_matrix_correlation. Use apop_data_correlation.
- apop_matrix_fill. Use apop_data_fill.
- apop_vector_to_array.
- apop_vector_grid_distance. Use apop_vector_distance(v1, v2, .metric='M').
