Emulating others and fat-tailed distributions
I finally posted my paper on leptokurtic (fat-tailed) outcomes to arXiv. I started writing it as part of my dissertation, so it took over a decade to put out. I'm still not entirely happy with it--it's two subpapers that don't meet as well as I want them to:
- The first subpaper gathers the literature in finance about how returns are leptokurtic and that people in finance have real incentives to emulate each other.
- The second subpaper writes down a simple model where agents get direct benefit from emulating each other--their utility function is increasing in a private valuation of an action and in the percentage of others who are acting--and finds that leptokurtic distributions result.
I've always thought that emulation is such a blatantly clear facet of human behavior, and was struck by how the economics literature largely ignores it. Your typical econ textbook starts off by saying that agents have a utility function which we will take as the given starting point, making as few assumptions as possible about what's in that function. If people have a strong preference for giving away their money, the textbook advises that our only job as economists is to model the implications of that preference. But then we go to the journals, and we find a rather narrow range of utility functions assumed. They have to be out there somewhere, but I never found a paper in a mainstream econ journal that was willing to assume that agents get direct benefit from emulating others.
Nonetheless, emulative behavior is fundamental to being human: mothers emulate their babies, who emulate their mothers; con men and magicians devise tricks that depend on how it can sometimes take effort to not emulate another person; cliques of people who grow to behave similarly are abundant, and not just in high school; people pay good money for fashion magazines that instruct the reader on how to dress like everybody else--even what underwear is fashionable.
We can sometimes explain away certain aspects of emulative behavior as really being
something more direct. For example, the actions of others imparts information, and gathering
information is expensive. So if somebody else buys product A
Depending on the assumptions you want to make, the outcomes from the
emulation-for-information or emulation-for-network story can be equivalent to the
outcomes from the direct emulation story. The only difference is that those stories are
special cases, and so applicable to fewer situations.
The broad intended lesson of my arXiv paper is that when you see a fat-tailed distribution, or a
widening gap between the winners and losers, then a social emulation story should pop
into your mind. Emulation causes a bit more push to the extreme outcomes--if, without
any emulation, private tastes would lead to 65% taking some action, we can expect that if
people have some slight preference to emulate the majority, then the same private tastes
would lead to maybe 67% taking the action. Apply this to every level of action, and
you've transformed your private-tastes distribution to have fatter tails.
There are some obvious cases where the herd-following story already comes up, especially in
the case of blockbusters and flops, including movies (see the paper by de Vaney and
Walls, cited in my arXiv paper) and cities (which are pretty obviously a network). But
there are more subtle cases, and cases where a lot of parties want to pretend that
emulation doesn't happen. The stock market is such a case, because none of the heroes in
any Ayn Rand novels had any interest in emulating others. As I mention in the paper's
conclusion, Congressional roll call votes provide another such case.
Further, here in the real world, equity traders and Congressfolk do have their
own opinions, and we can't expect a lot of their positions to be based on pure
emulation. Nor does the data demonstrate a bifurcated blockbuster/flop dichotomy
in equity returns or vote counts. If we had a simple Central Limit Theorem at play,
and people independently took positions, we'd expect a Normal distribution of returns
or vote counts. Instead, we see distributions that are bell curves with fatter tails
than the Normal. The paper demonstrates that we can explain this spreading-out via an
emulative component to the decision process.
To give another example, I just ran across an article that finds a blockbuster-flop
pattern in citations. It tries to blame it on an
information overload story, but it's also pretty easy to explain via emulative behavior,
either due to information gathering or simple fashion.
I found the paper via social means, from the
blog
of someone who was once a fellow student. He adds a useful point about Gini
coefficients (though applying his analysis to the explosion of publications seems to require
that we assume that as publication became cheaper, the additional papers that appear are
lower-quality and less citation-worthy).
Alexander Schuessler, in his Logic of Expressive Choice
(BUY!),
is my favorite work that models humans as expressing identity by joining groups and
emulating others. His title is an homage to Mancur Olson's Logic of
Collective Action, which bangs its head against the wall for the length of a book
asking why it is that people exert effort to work for politically-oriented groups when
the Prisoner's Dilemma predicts that they'd leave the work to others. Schuessler's
response is that if you are willing to allow the possibility that people have some
desire to be part of a social group, then there's not much of a paradox left.
There can be other explanations, and the fact that I've constructed a model which assumes
emulative behavior and ends in a leptokurtic outcome doesn't prove that (1) emulative
behavior must lead to leptokurtic outcomes or that (2) leptokurtic outcomes are always caused
by emulative behavior. But it does demonstrate that the causal story that emulation
⇒
The other thing I hope readers get out of the paper is that agent-based models can be
used to generate distributions, which we can then do statistics on as with any other
distribution. I'm working on a paper that goes into immense detail on this point,
and I'll try to post at least one example here over the next month.
[link][no comments]
My coworkers
I work with lots of statisticians. Here are a few, all of whom I recommend following:
Jerzy Wieczorek: Jerzy's blog is a pretty generalist statistics blog. He gets around the city, and often writes about meetups and other such events--he's how I found out about the recent data drives at the World Bank. He's also a twitter, whom you can follow @civilstat. I'm @b__k, btw.
Chris Blakely: he does time series, which is a somewhat foreign world to me. The methods are more oriented toward Fourier decompositions than fitting the parameters of bell curves. Chris has put a lot of work into writing new computing tools for the application of time series methods, which he applies to financial series. Even his diagrams are pretty awesome.
Josh Tokle: he is more a mathematician and computer geek than
a statistician. You can tell because he wrote the blog in a blogging platform you've never
heard of, though his first few posts have been on statistical topics. His diagrams are a
awesome in a diametrically different way from Chris's.
[link][no comments]
Frequentist, Bayesian, and why I don't care
Really intelligent people have made attempts to formalize the intuitive definition of odds/chance-of/probability/likelihood for a few centuries now. We have no new data: what real-world evidence on the link between our intuition about odds and the real world do we have that Pascal didn't have in the 1600s? The entropy angle is interesting, because we've only had the concept for, I dunno, 150 years, and information entropy for not even a century.
When there's a major problem that thousands of people possessing full information have written about in great detail, and it's still unresolved, that's when we conclude that it's a question of philosophy. It can be resolved in the sense that philosophical questions can be resolved, wherein distinct parties come up with internally consistent and personally satisfactory storylines and there is no objective test to determine which storyline is objectively correct.
Articles that talk about Bayesians and Frequentists often remind me of those comedians with bits about how Black people, they do it like this, .... But then White people, they do it like this: .... I end up wondering if the people being made fun of really exist here in the real world.
A lot of attacks on Frequentists are really attacks on oversimplified undergrad stats textbooks, or seat-of-the-pants statisticians who do things that seem to work but have no serious theoretical basis. Here's a review of a book by Nate Silver, written by a non-statistician, who got the impression from Silver's writing that Frequentists do nothing but produce inferior research. Using his commonsense prior knowledge, he concludes that Silver is beating up a straw man that is unlikely to be representative of real Frequentists.
Characterizing a class of people based on the subset that is least paying attention is fun for the feelings of superiority, but is otherwise a waste of time. Here's an essay (PDF) where Jayne says that he's done with Frequentists, pointing to certain people who had trouble understanding how likelihood functions relate to states of nature and calling them “handicapped”. Some day, Bayesian methods will be common enough that there are oversimplifying textbooks that provide recipes for users to misapply, and software that advertises itself as being so easy that even an undergrad who knows nothing of statistics can get Bayesianly correct estimates without knowing a lick of statistics. At which point we'll have essays about how Bayesians don't have a deep understanding of statistics.
On models
A model intermediates between data and parameters. We typically express a model in terms
of a probability/likelihood function, which I will write as
P(d, β)
At this point, you might want to go and read this entry on probability
versus likelihood, which explains that the function
P(d, β)
Having gotten the caveats about stereotyping out of the way, my working definition is
that Bayesians see
P(d, β)
Isn't the distinction between Bayesians and everybody else right in the title--that one
team uses Bayes's rule and everybody else doesn't? Bayes's rule has a one-line proof,
and your typical stats/probability textbook covers it by page twenty. To not
believe in Bayes's rule is equivalent to not believing basic standard probability theory.
Models that don't explicitly use Bayes's rule at some point over the course of
their typical description can frequently still be restated to make the underlying rule
more apparent; explicit use of Bayes's rule is often computationally expensive,
so Bayesian models are frequently reshaped to a version with no mention of Bayes's rule.
I found this article by David
Draper
(PDF, 95 slides) to be a nice summary of the descriptive-side situation. The first
25pp give an overview of the efforts to formalize the problem of defining a model
and separating the objective from the subjective, listing the many conditions that
are required to really make for a consistent theory. It points out that Bayes's rule
is essential to consistency, but then turns around and seems to imply that only models that
explicitly use Bayes's rule are going to be consistent, which is clearly not true (and I'm
guessing/hoping that DD would acknowledge this).
Grammar tip: possessives in English end in an apostrophe-s combination. As
an exception, this 's is omitted for plurals ending in s, such as the pedestrians' crossing. The exception about plurals does not apply to words that
are not plural but happen to end in s. See rule #1 in Strunk and White's Elements of
Style
for details.
We all want our models to lean to the objective side; how do we make that happen?
Here we get to the rift between the different schools. The Frequentists lean heavily
on the CLT and other such objectively-proven and subjectively-applied theorems.
Sometimes the CLT is preeminently defensible (as alluded above, typically back in the world of controlled
experiments on non-living things); sometimes it's an irritating refusal to really accept
the ambiguity of the situation.
Central limit theorems, when applicable, work. If elements really have a fixed
probability of leaving the set at any given time, then lifespans really do follow a
Poisson distribution. Inquiries in the physical sciences typically have a derivable
distribution like these built in to the controlled experiment, while in the social
sciences we typically have no idea what we're doing, and can't control anything in
the data. In both the physical and social sciences, the model is a human-imposed
structure on the data, but we can already see that in some situations people will be
more inclined to label the final statements about parameters as objective and in some
will be inclined to call them subjective. The reader will note that Bayesians tend
to hang out in the social sciences.
The Bayesians point out that there is still subjective information to be had, and maybe
we can codify it all into a prior distribution. OK, great, you've concentrated all your
subjectivity into one place. This forces a certain format on the model, and makes you
no more or less objective than you were before, but does make it easy to incorporate
correctly-codified additional information.
The typical Bayesian setup involves using a prior distribution, which is allegedly
subjective auxiliary information, and a likelihood distribution, which is often
described as the real model. This core-plus-auxiliary terminology creates a lot of
trouble, and life is easier when we take the entire pipeline, prior plus likelihood,
as the unit that we call a model. This is how Apophenia does it: the apop_update function returns a single model estimated using the data and setup you
provided. Now we're back where we were with the Frequentists: a single model that
expresses what information we have and what beliefs we have about the structure of the
model, including some portion of its inner workings that is clearly subjective and some
portion that is more clearly objective. Any claim that we could somehow componentize
the objective and subjective is chimerical anyway.
There are people who take the principle of maximum entropy to be an objective rule
about the universe, which has been observed to hold many times over. I have a coworker
who gets really pissed off when anybody mentions the principle of maximum entropy,
and thinks of any work using it as subjective bunk. The maxEntropists are making
another attempt to confront the problem that we have too little information and have
to fill it in with some sort of assumption, and having used a different principle,
wound up with a different final structure to the model.
The descriptive models of the different schools embody subjective information in
different ways. But in the end, they all have the same form of an assumed structure
P(d, β)
To this point, I've been talking about descriptive statistics: the art of combining
objective facts and subjective modeling decisions into a single final model, then
estimating its parameters from the data. Bayesians, Frequentists, maxEntropists
tend toward different models, that each feels is on the better side of the
objective-subjective scale. Then we have the inferential step, where we take the model's
final distribution describing the parameters,
P(β| d )
I do like to make one quibble about language: when an author says something like `given the
data, we fail to reject the null hypothesis with 91% certainty', a key element is left
out; this should read `given the data and the model, we fail to reject the null
hypothesis with 91% certainty'. We developed a model, and every statement we make from
that point on will rely on that model and its subjective and objective foundations.
Careful authors tend to get this right, undergraduate stats textbooks never do.
Quibbling after that point is largely a restatement of the same philosophy of science
question from the descriptive step: the probability distribution of a parameter is
fundamentally unobservable in most of the situations we're dealing with, so how do we
express that we've imposed a model and are stating probabilities using this invention
of ours, and how do those invented probabilities relate to reality?
Summary sentence: people have trouble dealing with things that are not easily
categorized, and
P(d, β)
[link][no comments]
Defining the model
The inferential step
In memory and on-disk databases for SQLite
This is mostly a housekeeping entry on Apophenia, but I'll start with a useful trick for speeding up dealings with SQLite. Next time, some statistical epistemology.
SQLite is exactly as much as you need to produce and store data tables and deal with them using SQL. It stores the database in a single file on disk or in memory. Naturally, the in-memory database will be faster, though as soon as your program exits, your database is wiped.
Here's how I use SQLite, where possible: I open a database in memory, and use that as the primary database. To get to the data on my on-disk database, I use SQL's attach command. So I'm opening the base database using the C-side databse opening command, and then attaching an auxiliary database on the SQL side. This is why Apophenia has only one global database handle, and has never seriously needed more.
There may be scale issues, but they're easily manageable, because we generally have a reasonable handle on how big any given table is going to be, and we can create scratch tables guaranteed to be much less than a gigabyte in memory, and potentially larger tables with a name like diskdb.bigtable, where diskdb is the attached on-disk database.
If you want more exposition, I detail this sort of thing in both Modeling with Data and 21st Century C (one of the few overlaps of the two books).
Some functions for merging
If you need a (not-too-large) table from the disk, you can copy it to the database without much effort; if you have a table in memory that needs to be saved for posterity, you can copy it to the disk. E.g.
create table diskdb.results as select * from results
I pictured the workflow wherein the user opens an in-memory database, copies up the entire on-disk database, and then gets to work, as being much more useful than it turned out to be. So I wrote a few functions to make it happen, and put them into Apophenia in 2005 (whoa--eight years ago). They just sat around, without serious testing or use, to the best of my knowledge.
So I trimmed them from Apophenia. If you still find them to be useful, here they are. The documentation is in Doxygen format. Apophenia has its own means of implementing C-standard variable argument lists, so I removed that part, leaving fixed-argument functions.
/** Merge a single table from a database on the hard drive with the database currently open.
\param db_file The name of a file on disk.
\param tabname The name of the table in that database to be merged in.
\param inout Do we copy data in to the currently-open main db [\c 'i'] or out to the specified auxiliary db[\c 'o']?
If the table exists in the new database but not in the currently open one, then it
is simply copied over. If there is a table with the same name in the currently open
database, then the data from the new table is inserted into the main database's table
with the same name. [The function just calls <tt>insert into main.tab select * from
merge_me.tab</tt>.]
*/
void apop_db_merge_table(char *db_file, char *tabname, char inout){
Apop_assert_n(tabname, "I need a non-NULL tabname");
char maine[] = "main";
char merge_me[] = "merge_me";
char *from = inout == 'i' ? merge_me : maine;
char *to = inout == 'i' ? maine : merge_me ;
if (db_file !=NULL)
apop_query("attach database \"%s\" as merge_me;", db_file);
int d = apop_query_to_float("select count(*) from %s.sqlite_master where name == \"%s\";", to, tabname);
if (!d){ //just import table
Apop_notify(2, "adding in %s", tabname);
apop_query("create table %s.%s as select * from %s.%s;", to, tabname, from, tabname);
}
else { //merge tables.
Apop_notify(2, "merging in %s", tabname);
apop_query("insert into %s.%s select * from %s.%s;", to, tabname, from, tabname);
}
if (db_file !=NULL)
apop_query("detach database merge_me;");
}
/** Merge a database on the hard drive with the database currently open.
\param db_file The name of a file on disk.
\param inout Do we copy data in to the currently-open main db [\c 'i'] or out to the specified auxiliary db[\c 'o']?
If a table exists in the new database but not in the currently open one, then it is
simply copied over. If there are tables with the same name in both databases, then
the data from the new table is inserted into the main database's table with the same
name. [The function just calls <tt>insert into main.tab select * from merge_me.tab</tt>.]
\li This is SQLite-only; I'm not sure if it really makes much sense for mySQL.
*/
void apop_db_merge(char *db_file, char inout){
apop_data *tab_list;
apop_query("attach database \"%s\" as filedb;", db_file);
tab_list= apop_query_to_text("select name from %s.sqlite_master where type==\"table\";"
, inout == 'i' ? "filedb" : "main" );
if(!tab_list) return; //No tables to merge.
for(int i=0; i< tab_list->textsize[0]; i++)
apop_db_merge_table(db_file, (tab_list->text)[i][0], inout);
apop_query("detach database filedb;");
apop_data_free(tab_list);
}
For bonus points, here is a command-line program making use of the above.
/** \file cmd_apop_merge_dbs.c A little command-line utility to merge
two databases. Try <tt>apop_merge_dbs -h</tt> for help.*/
/* Copyright (c) 2005--2007, 2009 by Ben Klemens. Licensed under the modified GNU GPL v2; see COPYING and COPYING2. */
#include <apop.h>
#include <unistd.h>
int main(int argc, char **argv){
int merge_ct = 0;
char c, msg[1000], **merges = NULL;
sprintf(msg, "%s [-v] [-t table_name] [-t next_tabname] main_db.db db_to_merge_into_main.db\n"
" -t\ttable to merge. If none, do all in the source db. Use as many as you'd like.\n"
" -v\tverbose\n", argv[0]);
if(argc<3){
printf("%s", msg);
return 1;
}
while ((c = getopt (argc, argv, "vht:")) != -1){
switch (c){
case 'v':
apop_opts.verbose ++;
break;
case 'h':
printf("%s", msg);
return 1;
case 't':
merges = realloc(merges, sizeof(char*)*merge_ct++);
merges[merge_ct-1] = optarg;
}
}
apop_db_open(argv[optind]);
if (merge_ct)
for (int i=0; i< merge_ct; i++)
apop_db_merge_table(argv[optind +1], merges[i]);
else
apop_db_merge(argv[optind +1]);
apop_db_close('n');
}
[link][no comments]
Raking with structural zeros
It's popular to set certain cells to zero, thus producing a not-quite-rectangular grid of values. The popular example around my workplace is that people under 15 can't be married, so we want our grid of options to exclude those cells for (14, married), (14, divorced), ... (0, married), (0, divorced), but to be otherwise square.
If you have a row in the table of margins that doesn't meet the constraint, then it gets thrown out as bad data, by the way. I'm not sure if there's a better way to deal with such things; feel free to leave your suggestions in the comments.
Here's an example, using almost the same synthesis problem as before, but w/o the NaN trick this time, and with extra data at (3, 1) to keep things interesting.
Apophenia uses a lot of SQL on the back end, so it's natural to express the structural zeros via SQL. In this case, we have only one structural zero, at row=1 and col=1.
apop_text_to_db -O -d="|" '-' margins sample.db <<"----------"
row | col | weight
1 | 1 | 2.5
1 | 2 | 2.5
2 | 1 | 7.5
3 | 1 | 7.5
2 | 2 | 7.5
----------
cat <<"----------" > rake.c
#include <apop.h>
int main(){
apop_db_open("sample.db");
apop_data_show(
apop_rake(.margin_table="margins", .count_col="weight",
.contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
.structural_zeros="row=1 and col=1"
));
}
----------
export CFLAGS="-g -Wall -O3 `pkg-config --cflags apophenia`"
export LDLIBS="`pkg-config --libs apophenia`" CC=c99
make rake
./rake
[link][no comments]
Raking to complete missing data
In this episode, we start with a data set with some complete observations and some observations with certain dimensions missing, and generate a PMF for the values of the complete data that accommodates all of the information in the partially-observed observations.
Further, we're going to do this using the same off-the-shelf raking routine as used in the last few episodes to adjust one data set toward another and to generate synthetic data.
Last timeLast time, when I showed you how we can synthesize data via raking, I presented a little trick to specify the table of margins, by including margins with missing values (represented by the IEEE NaN marker). If I have this data--
| X | Y | Z | count |
| NaN | 2 | 3 | 10 |
| 1 | 2 | 3 | 10 |
| 1 | NaN | 3 | 10 |
--then the (Y, Z)
This time, we'll read the NaNs as missing data. Having NaNs mixed in all over like
this is formally known as a non-monotone missingness pattern, but I prefer the
term Swiss cheese data. Simply summing the observations to margins
produces a set of margins that includes the observed information, including what
information is embodied in partially observed data. But we don't just want margins, we
want the probabilities for specific cells.
We can use the complete data as a starting point, and rake to find the closest data set
consistent with the complete information in the margins. The final result can be proven
to have some desirable properties; see Little & Rubin for details
(Amazon page).
They view the problem as a Bayesian updating problem, using the
information in the partially-observed observations to update the distribution of the
fully-observed cells, and they arrive at the same raking algorithm here. Note,
by the way, that their text uses only individual contrasts (in the notation here,
.contrasts=(char *[]){"X", "Y", "Z"}) but nothing keeps us from using
higher margins as well, like .contrasts=(char *[]){"X|Y", "X|Z", "X|Z"}. Doing so
would allow us to retain more cross-dimension information than their treatment.
Here's the cut-n-pasteable sample code. It's longer than before (and more
Apophenia-specific) because I added a little function to make draws from the
resulting completed table. That function estimates a model from the data set, via
apop_estimate(d, apop_pmf), then draws rows of data from the PMF. The rows
get written to a new data set that the function allocates and fills.
Also, there's a weighting step, currently commented out, which I'll discuss below.
One detail about the setup above: the totals might not be right. Say that our contrasts
were Z
[link][no comments]
apop_text_to_db -O -d="|" '-' cheese nans.db <<"----------"
row | col | weight
1 | 1 | 5
1 | 2 | 5
2 | 1 | 5
2 | 2 | 5
1 | nan | 5
2 | nan | 15
nan | 1 | 15
nan | 2 | 5
----------
cat <<"----------" > rake.c
#include <apop.h>
apop_data* make_draws(apop_data *d, int count, gsl_rng *r){
apop_data *draws=apop_data_alloc(count, d->matrix->size2);
apop_model *m = apop_estimate(d, apop_pmf);
for (int i=0; i< count; i++){
Apop_row(draws, i, onerow);
apop_draw(onerow->data, r, m);
}
apop_data_show(draws);
return draws;
}
int main(){
apop_db_open("nans.db");
apop_table_exists("init", 'd');
apop_query("create table init as select * from cheese "
"where row + col is not null");
apop_data *filledin= apop_rake(.margin_table="cheese", .count_col="weight",
.contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
.init_table="init", .init_count_col="weight",);
//gsl_vector_scale(filledin->weights, 1./apop_vector_sum(filledin->weights));
apop_data_show(filledin);
gsl_rng *r = apop_rng_alloc(342134);
make_draws(filledin, 20, r);
}
----------
export CFLAGS="-g -Wall -O3 `pkg-config --cflags apophenia`"
export LDLIBS="`pkg-config --libs apophenia`" CC=c99
make -s rake && ./rake
Synthesis via raking
I'd promised you some raking for missing data, but I'm putting that off for one more episode.
Last time ILast time showed you a table--only the margins mattered, so I'll just print those:
| - | - | 5 |
| - | - | 15 |
| 10 | 10 |
Apophenia's raking setup doesn't take in pure margins like this; instead it wants a data set giving the value for eack cell in the table, that it will then sum into margins. So what should we fill in for the dashes? We just want something that matches the margins, and the exact values don't matter. If (1, 1, 1, 1) somehow fit, we'd run with it.
Of course, that question is exactly the raking question all over again: what is the closest table to the all-ones table that matches the given margins?
For today's sample code, let's make it happen. But before presenting the code, let me point
out a trick I used to express the margins. Let us say that we have these three cells in the
table:
I need the NaN trick because my premise is that we don't have an already-extant data set that
would produce the margin table. IRL, we typically have such a data set, so synthesis
consists of finding the closest data set to the all-ones set that fits the original
margins as written. At the end of the procedure, you have data that encapsulates the
info from the margins of the margin table, but didn't use the individual cell values
at all--synthetic data.
Here's the sample code for the NaN-using scenario, which should be reasonably
legible. Again, if you have the requisites installed, you can cut/paste it to the
command line and watch it go.
So far, I've been showing you only two-dimensional data, because that's what's easy
to display on a static screen. Everything has obvious extension to margins in three,
four, or twenty dimensions.
There are two intersting consequences to multiple dimensions, though. The first, which you
shouldn't have to care about, is that high-dimensional matrices tend to be very sparse, so
processing requires some care. Last time, I mentioned a table with 1.9 billion cells, of
which 3 million were nonempty; you want a procedure that can work with only the 3 million cells.
The other is that we might want to fix to the margins not only single dimensions
like rows and then columns, but higher-dimensional combinations of elements, such as
specifying that all X, Z
If you are holding fixed a higher-dimensional slices, like X| Z
The people writing regression-type models like this because it makes sense that if you are
regressing on the `interaction' of X
Using Apophenia's syntax for this stuff (which is pretty similar to others, but for the
C's compound literal (char*[]) cruft), we might have a call to the raking function like:
This is going to generate synthetic data with the right tract|
At the extreme, if we have a 3-D table, and hold fixed the contrast X| Y| Z
Next time: structural zeros, and the promised missing data imputation.
[link][no comments]
(1, 1) = 0
(1, 2) = 0
(1, NaN) = 10
Then the sum for row 1
apop_text_to_db -O -d="|" '-' margins na.db <<"----------"
row | col | weight
1 | 1 | 0
1 | 2 | 0
2 | 1 | 0
2 | 2 | 0
1 | nan | 5
2 | nan | 15
nan | 1 | 10
nan | 2 | 10
----------
cat <<"----------" > rake.c
#include <apop.h>
int main(){
apop_db_open("na.db");
apop_query("create table init as select row, col, 1 as val "
"from margins where row is not null and col is not null");
apop_data_show(
apop_rake(.margin_table="margins", .count_col="weight",
.contrasts=(char*[]){"row", "col"}, .contrast_ct=2,
.init_table="init", .init_count_col="val",)
);
}
----------
export CFLAGS="-g -Wall -O3 `pkg-config --cflags apophenia`"
export CC=c99 LDLIBS="`pkg-config --libs apophenia`"
make rake
./rake
Many dimensions
apop_rake(.margin_table="controltab",
.constrasts=(char*[]){"tract|age|sex", "race|ethnicity|tract"},
.contrast_ct=2);
