The data set: views of a C structure

13 August 13. [link] PDF version

The Apophenia library is built on two structures, one representing data and one representing models. I've talked a lot about the apop_model struct, because it's novel and facilitates novel types of analysis. Conversely, the apop_data struct represents numeric and text observations, which is not especially innovative. Nonetheless, most of what we do is simple data manipulation, not modeling per se, so having a good structure for data matters.

The next several entries offer notes and how-to on the apop_data structure. I will assume that you're interested in the questions about what roadblocks come up in building a structure for data, and may or may not be interested in using Apophenia itself.

Here's the definition of the struct. It has a lot of parts. I'll discuss the rationale and use of all of them over the next several entries.

typedef struct apop_data{
    gsl_vector  *vector;
    gsl_matrix  *matrix;
    apop_name   *names;
    char        ***text;
    size_t      textsize[2];
    gsl_vector  *weights;
    struct apop_data   *more;
    char        error;
} apop_data;

In this episode: the matrix.

Structuring your data

Your data is held in memory somewhere, in some format. You would write it down on paper as a table, with observations along the rows and each column representing some variable (maybe observation ID, height, weight, gender, race…).

A computer doesn't have the luxury of a two-dimensional page. Data addresses are linear, so the machine will have to represent your 2-D structure in a linear form. From this simple problem statement comes a host of little details.

For example, we might take a $3\times 4$ grid and turn it into the sequence

(0,0) (0, 1) (0, 2) (0, 3) (1,0) (1, 1) (1, 2) (1, 3) (2,0) (2, 1) (2, 2) (2, 3)

Already, I've made a few choices. I'm used to C, so I've used offset numbering, where the index represents how far you are from the first element (or row or column). I'm using row-major ordering, meaning that the primary clumps are rows, and any one column of data is split up. Row-major is the C custom, but I find myself applying functions like the log likelihood on a per-observation basis more often than I'm operating on a column in isolation. Not that it really matters much.

A set of elements is regularly spaced if the distance between item $i$ and item $i+1$ is always fixed. Thus, the distance from element (0, 1) to the next in the row, (0, 2), is the same distance between any other row-neighbors, such as (1, 2) and (1, 3). The columns are also regularly spaced: the distance from (0, 1) to (1, 1) is four steps—the size of one row—and this is the same for any pair of column-neighbors. You'll find authors who refer to this jump as the stride of the matrix.

A fixed-width jump is among the fastest operations a computer can do. Your computer hardware is optimized to deal with it: after you fetch addresses 600, 602, and 604, a modern processor will pick up the pattern and preemptively fetch what's at 606 before your program even asks for it. In typical cases, it doesn't matter whether the distance is one unit as when traversing a row or a larger stride as when traversing a column.

Rare is the data processing algorithm that doesn't have to fetch every element of a matrix, so these little speedups on fetching one record scale in proportion to the size of the data set.

Other systems do it differently, throwing out some form of regular spacing in exchange for some other benefit. R data frames bundle by columns. Each column may have a different type, but a single column is (typically) a regularly-spaced vector of data. SAS and databases bundle by observations (rows). Each observation is a sequential blob of data of different types. In C-type languages, we could implement this via a struct representing one observation and then building an array or tree of such structs.

The GNU Scientific Library implements matrices in the form I described above, with a data segment that is a simple row-major list of elements and some metadata indicating the row count, column count, starting point, and stride.


Say that we have a matrix with this metadata:

If you need to find element (2, 3), then this is enough information that we can start at (0, 0), then step forward 2*stride + 3 units. If you try to step more than three rows or four columns, then the system knows to throw an error instead of overreaching. So this metadata is sufficient to turn a pointer to a sequence of elements into a real matrix.

Now we want the column vector that is column 1 in the data. You could build that struct by changing only the metadata:

Notice how the stride is the size of a row in the original data set, even though a `row' in a column vector is one element wide. But the math is the same: the element at position (2, 0) in this column vector is at position (starting point) + 2*stride + 0.

I leave as a simple exercise to the reader how to build any other regular shape, like a two-by-two square starting at (1,1), or only row 3.

So you get views of columns, rows, or other regular shapes by simply annotating the data accordingly. What you don't get are irregularly-spaced elements. If you want columns 1, 2, 4, and 8, then you'll need more metadata than we have here; the typical solution is at this point to give up on the elegance of rewriting the metadata for the same raw data and start copying columns into a new structure. If you've been reading along to this point, you can see how this has shaped the recommendation that users build data sets in SQL and then write them once in the correct form to a data set, because options for subsetting after writing to a regular matrix aren't as nice (but Apophenia does provide several, all of which involve data copying).

By the way, it looks like somebody posted some slides about how this is also how NumPy arrays work.


This is all about vectors and matrices from the GNU Scientific Library, but I'm going to use Apophenia, for exactly the reason I wrote Apophenia: it provides a nicer UI for the GSL.

For example, the raw GSL provides several functions to slice a GSL matrix, to pull a column or row, or subset a GSL vector. Apophenia has some convenient wrapper macros for many of these functions. E.g., to get the covariance of columns 2 and 3:

gsl_matrix *m = [read in data here];
Apop_matrix_col(m, 2, v1);
Apop_matrix_col(m, 3, v2);
apop_vector_cov(v1, v2);

After being declared within the macro, v1 and v2 correctly behave as vectors. They are pointers to structs on the stack, so they disappear at the end of scope, and they are metadata re-wrappings of the same data as the original, so changes in the vector view affect the base data.

Here's a full example. This time, the key interesting output will be a principal component analysis of the Amash vote data which first appeared several episodes ago. Roughly, a PCA reveals in which column the greatest variation in the data lies. If there's a strong correlation between two columns, then that variation may be along a linear combination of those columns. The example uses Apop_col_v (like Apop_matrix_col but takes in an Apop_data set) to get only the first eigenvector from the matrix of eigenvectors.

#include <apop.h>

int main(){
    apop_text_to_db("amash_vote_analysis.csv", .tabname="amash");
    apop_data *d = apop_query_to_data("select "
            "case when vote='Aye' then 0 else 1 end as vote, "
            "ideology, log(contribs+10) from amash");


    apop_data *eigens = apop_matrix_pca(d->matrix);


    printf("\nPrincipal eigenvector:\n");
    Apop_col_v(eigens, 0, principal);

The query describes what the three columns of data are. I set Aye=0 for consistency with prior analyses; vote=1 indicates a pro-NSA wiretap leaning. I took the log of contributions, for reasons that will become evident below. The Logit analysis from a few episodes ago is also very different when we replace contribs with log(contribs+10). Instead of adding 10 to dodge log(0) errors, you could also insert a where contribs > 0 clause to the query.

The program runs apop_data_correlation, then apop_matrix_pca, which puts the eigenvalues in the output data set's vector, and the eigenvectors in the columns of the output data set's matrix.

Even with the where contribs>0 clause, the first eigenvector is the lion's share of the variation in the data, and that eigenvector consists mostly of the log(contribs) column. As you can imagine, without logs contributions represent $\sim$100.00% of the variation in the data.

If you run this, you'll see the immense benefit of fast addressing: even on slow hardware, this runs in a fraction of a second, whereas calculating the PCA by hand can take literally hours.

[Previous entry: "Another C macro trick: parenthesized arguments"]
[Next entry: "De-crosstabbing"]