Raking to complete missing data
08 December 12. [link] PDF version
In this episode, we start with a data set with some complete observations and some observations with certain dimensions missing, and generate a PMF for the values of the complete data that accommodates all of the information in the partially-observed observations.
Further, we're going to do this using the same off-the-shelf raking routine as used in the last few episodes to adjust one data set toward another and to generate synthetic data.
Last timeLast time, when I showed you how we can synthesize data via raking, I presented a little trick to specify the table of margins, by including margins with missing values (represented by the IEEE NaN marker). If I have this data--
X | Y | Z | count |
NaN | 2 | 3 | 10 |
1 | 2 | 3 | 10 |
1 | NaN | 3 | 10 |
--then the (Y, Z)
This time, we'll read the NaNs as missing data. Having NaNs mixed in all over like
this is formally known as a non-monotone missingness pattern, but I prefer the
term Swiss cheese data. Simply summing the observations to margins
produces a set of margins that includes the observed information, including what
information is embodied in partially observed data. But we don't just want margins, we
want the probabilities for specific cells.
We can use the complete data as a starting point, and rake to find the closest data set
consistent with the complete information in the margins. The final result can be proven
to have some desirable properties; see Little & Rubin for details
(Amazon page).
They view the problem as a Bayesian updating problem, using the
information in the partially-observed observations to update the distribution of the
fully-observed cells, and they arrive at the same raking algorithm here. Note,
by the way, that their text uses only individual contrasts (in the notation here,
.contrasts=(char *[]){"X", "Y", "Z"}) but nothing keeps us from using
higher margins as well, like .contrasts=(char *[]){"X|Y", "X|Z", "X|Z"}. Doing so
would allow us to retain more cross-dimension information than their treatment.
Here's the cut-n-pasteable sample code. It's longer than before (and more
Apophenia-specific) because I added a little function to make draws from the
resulting completed table. That function estimates a model from the data set, via
apop_estimate(d, apop_pmf), then draws rows of data from the PMF. The rows
get written to a new data set that the function allocates and fills.
Also, there's a weighting step, currently commented out, which I'll discuss below.
One detail about the setup above: the totals might not be right. Say that our contrasts
were Z
[Previous entry: "Synthesis via raking"]
apop_text_to_db -O -d="|" '-' cheese nans.db <<"----------"
row | col | weight
1 | 1 | 5
1 | 2 | 5
2 | 1 | 5
2 | 2 | 5
1 | nan | 5
2 | nan | 15
nan | 1 | 15
nan | 2 | 5
----------
cat <<"----------" > rake.c
#include <apop.h>
apop_data* make_draws(apop_data *d, int count, gsl_rng *r){
apop_data *draws=apop_data_alloc(count, d->matrix->size2);
apop_model *m = apop_estimate(d, apop_pmf);
for (int i=0; i< count; i++){
Apop_row(draws, i, onerow);
apop_draw(onerow->data, r, m);
}
apop_data_show(draws);
return draws;
}
int main(){
apop_db_open("nans.db");
apop_table_exists("init", 'd');
apop_query("create table init as select * from cheese "
"where row + col is not null");
apop_data *filledin= apop_rake(.margin_table="cheese", .count_col="weight",
.contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
.init_table="init", .init_count_col="weight",);
//gsl_vector_scale(filledin->weights, 1./apop_vector_sum(filledin->weights));
apop_data_show(filledin);
gsl_rng *r = apop_rng_alloc(342134);
make_draws(filledin, 20, r);
}
----------
export CFLAGS="-g -Wall -O3 `pkg-config --cflags apophenia`"
export LDLIBS="`pkg-config --libs apophenia`" CC=c99
make -s rake && ./rake
[Next entry: "Raking with structural zeros"]