### Making a data structure comfortable

09 September 13. [link] PDF version

This is the last entry in a series of posts on the design and use of the
`apop_data` struct, beginning with this entry. Next time, I'll get to
some more statistical application, regarding interrater agreement.

The purpose of this five-entry series was to consider all the little details of what makes a good structure for data, and this final episode covers four additions to the structure so far to make this structure extensible and, well, comfortable.

###### Names

I haven't mentioned column and row names, though they are absolutely
essential for us as humans to understand our data, and are a key difference between
a matrix manipulation library and a data analaysis library. But they largely take
care of themselves. When you query from the database, the names get put in place,
and then it's my problem as the author of functions like
`apop_data_transpose`,
`apop_data_cross` and `apop_data_split` to make sure that the names get put in the right place.

The most basic use of names is to use them interchangeably with the numeric index of the row or column:

```
apop_data *d = apop_data_calloc(3, 3); //allocate 3x3 matrix filled with zeros.
apop_data_add_names(d, 'c', "left", "middle", "right");
apop_data_add_names(d, 'r', "top", "middle", "bottom");
apop_data_set(d, 2, 2, 2.2);
apop_data_set(d, .rowname="middle", .colname="right", .val=1.2);
apop_data_set(d, .rowname="top", .col=0, .val=0.0);
//view columns or rows by name:
Apop_data_col_t(d, "right", rightcol);
double rightsum = apop_vector_sum(rightcol);
```

The names are in their own mitochondria-like struct, the `apop_name`, which is rarely
if ever seen outside of `apop_data` structs, but provides its host `apop_data` struct
with much of its power. As with mitochondria, names being in a separate structure was an
evolutionary anomaly, but there isn't much reason to merge them fully into the host structure either.

Also, having a separate struct makes it easier to implement hashes for the text of names, to speed up name lookups. (This is another to-do item for anybody interested in working on Apophenia).

###### Weights

Real-world data sets are often weighted, so our data set will need a weights vector. At my workplace, we have surveys where there are more weights columns than data columns, but those can always be reduced to a single weight column for any given analysis.

So let's add that as another vector to the `apop_data` set.

There used to be separate
`apop_ols` and
`apop_wls` (weighted least squares) models. But there was no need: if you give me
weights, the `apop_ols` functions use them to implement WLS, else it runs as ordinary.

An empirical distribution is one where the probability of an event equals its frequency in
observed data. So we take a data set, add weights, and call it a model. The most common
weighting is where every element has equal weight, so that is what is assumed if
`your_data->weights==NULL`. In either case,
`apop_estimate(your_data, apop_pmf)` outputs a PMF model that appropriately wraps the data. See also
`apop_data_pmf_compress`
which reduces redundant observations to one weighted observation.

###### Pages

At this point, we have rectangular data covered. Apophenia tries to not place restrictions requiring that the vector and matrix have the same number of rows, so you could even have one column of numbers that doesn't quite fit the rectangle.

But some data doesn't fit a single grid at all. Maybe your aggregate model requires some data at the individual level and distinct additional data at the aggregate level. Maybe you have a parameter set that is a vector, a matrix, and a handful of loose scalars.

So the `apop_data` struct includes a pointer to `apop_data`. Suddenly it's a linked list
of data sets. Given two rectangles of data as per the examples above, link one to the
next (via
`apop_data_add_page`) and there you have it: one data set with two distinct
pages that you can send to a custom model written around this data configuration.

The MLE routine will gladly optimize over paramters that span several pages. I can't believe I haven't demo'ed this yet, but this is yet another opportunity to build bigger models by taping together simple models.

Also, there's footnotes. Maybe you have the covariance of the main data set, or a list
of factor conversions, or even text footnotes referenced in the main data (see *NaN
boxing* at this earlier blog post or in
21st Century C). You could
put that in a separate `apop_data` set, but everywhere you send the main data, you have to
send the footnote table; might as well tack it on to the main data set.

I wound up with this rule: if the name of a data set is in `<html-style angle brackets>`, then it's not data *per se* but information about data. Those operations
that act on all pages of a data set (like
`apop_data_pack` and
`apop_data_unpack`,
to de-structure a data set into one long vector) will typically ignore info pages unless
you ask them not to. For example, Apophenia's MLE routine has to use `apop_data_pack`
to de-structure the parameters, because the GSL's optimization routines work on
`gsl_vector`s, so info pages in a parameter set are ignored.

###### The error marker

Oh, and one last thing your math textbook didn't tell you
about: processing errors happen. The `error` eleemnt is a single character, and
if it isn't nul, then something went wrong with processing your data set. This makes
for a nice processing flow, because you can directly ask the data set returned
by a function whether the process went OK.

###### Now have some fun

So there's the `apop_data` struct. It started with just a rectangular matrix,
but expanded to be a vector-matrix hybrid, with a text grid, and weights, and names,
and extra pages for additional rectangles of data, and a marker for processing errors.
This isn't Nobel-worthy innovation, but designing a structure with enough hooks so that a
diverse range of data sets can be handled comfortably and reliably is not entirely
trivial, either.

Working with it feels like manipulating building blocks, and it's structured enough
that it's easy (albeit sometimes tedious) to write functions manipulating those blocks in
predictable ways. I've found working with it to be largely natural, and very predictable.
There are many situations where I'm not doing statistics *per se*, but use Apophenia to
keep tables of data organized.

Scroll down to the 49 functions currently listed in the `apop_data_`… section of
Apophenia's function index to see the sort of
utility functions that I've written around the structure.
Besides the ones that shunt data around, there are also a few, like
`apop_data_summarize`,
`apop_data_correlation`,
or
`apop_data_covariance` that do basic statistical work.

[Previous entry: "Dealing with arrays of text"]

[Next entry: "Interrater agreement"]