Patterns in static

Apophenia

An outline of the library

Expand all | Collapse all

pip Overview and Preliminaries

As well as the information in this outline, there is a separate page covering the details of setting up a computing environment and another page with some sample code for your perusal.

pip A quick overview

A typical data analysis project using Apophenia's tools would go something like this:

  • Read the raw data into the database using apop_text_to_db.
  • Use SQL queries handled by apop_query to massage the data as needed.
  • Use apop_query_to_data to pull the data into an in-memory apop_data set.
  • Call a model estimation such as
     apop_estimate (data_set, apop_OLS)
    
    or to fit parameters to the data. This will return an apop_model like the original, but with parameter estimates.
  • Interrogate the returned estimate, by dumping it to the screen with apop_model_show, sending its parameters and variance-covariance matrices to a test, et cetera.

Over the course of this, our researcher had to parse text, use a database, manipulate matrices, fit statistical models, and then test the results by looking up a summary statistic in a table of distributions. Having hooks to all of these things in one library makes the flow much more pleasant.

For a concrete example of folding Apophenia into a project, have a look at this sample program.

Of course, you may not need all this: you might just want a neat interface to SQLite, or an easy-to-use structure that holds your data in a matrix and allows you to label the rows and columns. Have a look at the library outline for the various categories of relatively standalone functions that you can fold into your own work.

If you would like a more in-depth discussion of Apophenia's raison d'etre and logic, have a look at these design notes.

pip The typical stats package workflow v Apophenia's workflow

From the quick overview section above, you saw that (1) the library expects you to keep your data in a database, pulling out the data as needed, and (2) that a great deal of statistical information, and the workflow, is encapsulated in apop_model structures.

Starting with (2), some stats packages have model structures that typically work very differently from those here. The key conceptual difference is in breadth, which implies a practical difference in how data is handled.

If a stats package has something called a model, then it is probably of the form Y = [a function of X], such as $y = x_1 + \log(x_2) + x_3^2$. Trying new models means trying different functional forms for the right-hand side, such as including $x_2$ in some cases and excluding it in others. Conversely, Apophenia works to facilitate trying new models in the sense of switching out a linear model for a hierarchical, or a Bayesian model for a simulation. A formula syntax makes little sense over such a broad range of models.

As a result, Apophenia does not put the form of the left-hand side into the model; the data is assumed to be correctly formatted, scaled, or logged before being passed to the model. This is where part (1), the database, comes in.

Many stats packages were initially designed around data literally being provided in decks of punchchards, and still seem to assume that the data is basically to be taken as read-only. With a database, the GSL, and the various tools provided by Apophenia, you have more than enough tools to make your data look how you want it to. When you do want to try swapping out different formula specifications, the database provides a proxy for the sort of formula specification language above:

apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3,2) from data");
apop_model *est = apop_estimate(testme, apop_ols);

[As an aside, there is a simple principle to the ordering of arguments to functions like apop_estimate : the data always comes first.]

Generating factors and dummies is also considered data prep, not model internals. See apop_data_to_dummies and apop_text_to_factors.

Now that you have an estimated model (which holds the estimated parameters), you can interrogate it.

 //If you have a new data set, you can fill in predicted values:
apop_predict(new_data_set, est);
apop_data_show(new_data_set)

 //Make random draws. The draw function needs an RNG from
 //the GNU Scientific Library, and a pointer-to-\c double.
gsl_rng *r = apop_rng_alloc(218);
for (int i=0; i< matrix->size1; i++){
    Apop_matrix_col(matrix, i, one_col);
    apop_draw(one_col->data, r, est);
}

So that's the intended workflow: first, gather the data, probably via the database, pull the data in the format you want it; then, set up your model, estimate and interrogate it. [See also the model settings section below on optional details of model setup.]

pip Some notes on C and Apophenia's use of C utilities.

pip Learning C

Modeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything.

Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.

pip Usage notes

Here are some notes about the technical details of using the Apophenia library in your development environment.

Header aggregation

If you put

#include <apop.h>

at the top of your file, then it will call virtually every header file you could need: gsl_matrix.h, gsl_blas.h, sqlite3.h, stdio.h, string.h, math.h, apophenia_all_of_them.h, et cetera. Of course, if you get `implicit declaration of...' then you will need to manually include something else. Bear in mind that every book on C will tell you this is bad form and you shouldn't do it.

Easier calling syntax

Several functions (but nowhere near all) use a script-like syntax for calling functions. For example, the apop_vector_distance function finds the distance between two vectors using various metrics. The full function call, using an $L_3$ norm for vectors $v_1$ and $v_2$ would be

apop_vector_distance(v1, v2, 'L', 3);

But some metrics don't need a number to be fully described, so you can leave that off. Here's the standard (Euclidean) distance:

apop_vector_distance(v1, v2, 'E');

Because Euclidean distance is so standard, it is assumed as the default if you don't specify a norm, so you can leave out the third input:

If you give only one vector, I'll assume the other is the zero vector, and thus return the length of the first vector. Here is the length of a vector using the Manhattan metric:

apop_vector_distance(v1, .metric='M');

See the Designated initializers page for details of this syntax.

pip Libraries

Your best bet is to write yourself a Makefile. If you don't want to use the sample Makefile, then here are some notes for the command line. When compiling, you will need to tell the compiler to use the Apophenia library. That is, you will need to call GCC with gcc -lapophenia (as well as the other usual flags). For example, gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g will produce an output file named run_me from the input source code sample.c. It will include symbols for debugging (-g) and will correctly find the functions in the Apophenia, GSL, and SQLite libraries (-lapophenia -lgsl ...). Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, gcc sample.c -lapophenia will work, while gcc -lapophenia sample.c is likely to give you errors. Similarly, list -lapophenia before -lgsl, which comes before -lgslcblas.

pip Debugging

The global variable apop_opts.verbose turns on some diagnostics, such as printing the query sent to the database engine (which is useful if you are substituting in many %ses). Just set apop_opts.verbose =1 when you want feedback and apop_opts.verbose=0 when you don't.

If you use gdb, you can define macros to use the pretty-printing functions on your data, which can be a significant help. Add these to your .gdbinit:

define pv
    p apop_vector_show($arg0)
end

define pm
    p apop_matrix_show($arg0)
end

define pd
    p apop_data_show($arg0)
end

define pa
    p *($arg0)@$arg1
end

Then just pd mydata to see all components of your data set neatly displayed, or pa myarray 5 to see the first five elements of your array.

pip Syntax highlighting

If your text editor supports syntax highlighting, there are a few types defined in the Apophenia and GSL headers which may be worth coloring. E.g., for vim, add the following two lines to /usr/share/vim/syntax/c.vim:

syn keyword     cType           gsl_matrix gsl_rng gsl_vector apop_data
syn keyword     cType           apop_name apop_model

Other text editors have similar files to which you can add the above types.

pip The Python interface

The distribution includes a Python interface via the SWIG tool for bridging across languages.

Installation: You will need to have the SWIG and Python-development packages installed on your computer when compiling. From there,

./configure --enable-python
make
sudo make install

will produce the Python library and install it in Python's site-packages directory.

Sample script:

On the page on the apop_ols model, you will find a few lines of toy data and a sample program to run an OLS regression. Once you have set up the data file, you can use this Python rewrite of the OLS program:

from apop import *
apop_text_to_db("data", "d", 0, 1, None)
data = apop_query_to_data("select * from d")
est  = apop_estimate(data, avar.apop_ols)
print est

#And one last example of using a method of the apop_data object:

print est.parameters.transpose()
  • SWIG sets all C-side global variables into a category named avar. Thus the apop_ols variable had to be called avar.apop_ols.
  • The verbose package-object-action scheme for naming functions is mostly unnecessary in Python, where objects can more closely be attached to functions. Thus, instead of calling apop_vector_skew(my_v), you would call my_v.skew(). If you want a list of methods, just use help(my_v) or any of the other familiar means of introspection.

Here is another simple example, that copies a Python-side list into a matrix using apop_pylist_to_data, and then runs a Fisher Exact test on the matrix. If the list were one-dimensional (flat), then the data would be copied into the vector element of the returned apop_data structure.

from apop import *
data = [[15, 13], [18, 19]]
adata = apop_pylist_to_data(data) 

print "display the data"
print adata.matrix

result = apop_test_fisher_exact(adata)
print result

print "Or, just one value:"
print result.get("p value", -1)
  • The focus of the work is still in C, so there will likely always be things that you can do in C that can't be done in Python, and strange Python-side errors that will only be explicable if you understand the C-side. That said, you can still access all of the functions from Python (including those that make little sense from Python).

pip SQL, the syntax for querying databases

For a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. The blog of Apophenia's author includes an entry about complementaries between SQL and matrix manipulation packages.

Apophenia currently supports two database engines: SQLite and mySQL. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.

You can switch to mySQL two ways: set apop_opts.db_engine = 'm', or set the environment variable APOP_DB_ENGINE=mysql. Otherwise, the system will use SQLite. Ideally, after you make this switch, you need make no other changes--- apop_query, apop_query_to_data, apop_table_exists, et cetera, will work as before.

Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see Database moments (plus pow()!).

pip The book version

Apophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org.

If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.

As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry, which should be be easy to re-shape for other environments:

@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}

pip Data sets

The apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.

For example, let us say that you are running a regression: there is a vector for the one dependent variable, and a matrix for the several independent variables. Think of them as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some code to print the entire matrix. Notice that the column counter i starts counting at -1.

  for (j = 0; j< data->matrix->size1; j++){
    printf("%s\t", apop_name_get(data->names, j, 'r'));
    for (i = -1; i< data->matrix->size2; i++)
        printf("%g\t", apop_data_get(data, j, i));
    printf("\n");
    }

We're generally assuming that the data vector and data matrix have the same row count: data->vector->size==data->matrix->size1 . This means that the apop_name structure doesn't have separate vector_names and row_names elements: the rownames are assumed to apply for both.

The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information: estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. apop_data_copy and apop_data_free will handle all the pages of information. The more pointer is not intended as a linked list for millions of data points---you can probably find a way to restructure your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).

Easy data manipulation is essential for enjoying life as a researcher. Thus, there are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.

Note: Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. You will find a link to the full GSL documentation at the end of this outline, and the text of the outline will list the names a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient for your needs.

pip Alloc/free

See also:

  • gsl_vector * gsl_vector_alloc (size_t n)
  • gsl_vector * gsl_vector_calloc (size_t n)
  • void gsl_vector_free (gsl_vector * v)
  • gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)

pip Using GSL Views

You may need to pull a row of a matrix as a distinct vector. The GSL makes it easy to do so, and Apophenia provides several convenience functions to save any remaining hassle. The data in the new vector will point to the base data, not copy it. Some examples:

apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");
gsl_matrix *m = d->matrix;

//Get a column using its name
Apop_col_t(d, "obs1", ov);
double obs1_sum = apop_vector_sum(ov);

//Get a row using its index
Apop_row(d, 0, v);
double first_row_sum = apop_vector_sum(v);

//First column's sum, matrix format
Apop_matrix_col(m, 0, mv);
double first_col_sum = apop_vector_sum(mv);

//Pull a ten by five submatrix, whose first element is the (2,3)rd
//element of the parent data set's matrix
Apop_submatrix(d, 2,3, 10,5, subm);
double first_col_sum = apop_matrix_sum(subm);

These macros make use of a set of GSL matrices that produce output of type gsl_matrix_view and gsl_vector_view. These types point to the source data, and add metadata to turn the data into a coherent matrix/vector. Apophenia's macros generate these views, then pull the matrix/vector from them, so you never have to deal with the _view structs directly.

If the macros don't work for you, you can use the GSL's macros directly. Here's how to get the fifth row of a_matrix into a vector view:

gsl_vector_view v;
v = gsl_matrix_col(a_matrix, 4);

For rows, use gsl_matrix_row(a_matrix, n). The vector view is a data structure which includes an element of type gsl_vector named vector; this is the only element you will be interested in. The expression &(v.vector) is of type gsl_vector *, and therefore can be used as you would any other pointer to a gsl_vector. For example, try apop_mean(&(v.vector));.

The view is intended to be a common variable, not a pointer. If you want to retain the data after the function exits, copy it to another vector:

gsl_vector_view v;
gsl_vector *a_new_vector = gsl_vector_alloc(a_matrix->size1);
v = gsl_matrix_col(a_matrix, 4);
gsl_vector_memcpy(a_new_vector, &(v.vector));

pip Set/get

Let $t$ be text, and $i$ be an index. Then you can get or set an element by row title and column index, row index and column title, et cetera. The functions that have no suffix find an element by row and column index.

The apop_data_ptr_... functions return a pointer to the element.

See also:

  • double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)
  • double gsl_vector_get (const gsl_vector * v, size_t i)
  • void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)
  • void gsl_vector_set (gsl_vector * v, size_t i, double x)
  • double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)
  • double * gsl_vector_ptr (gsl_vector * v, size_t i)
  • const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)
  • const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)
  • gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)
  • gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)
  • gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)
  • gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)

pip Map/apply

Map/apply These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.

There is an older and a newer set of functions. The older versions, which act on gsl_matrixes or gsl_vectors have more verbose names; the newer versions, which act on the elements of an apop_data set, use the Designated initializers syntax to ad a few options and a more brief syntax.

You can do many things quickly with these functions.

Get the sum of squares of a vector's elements:

double sum_of_squares = apop_map_sum(dataset, gsl_pow_2); //given <tt> apop_data *dataset</tt>

double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2); //given <tt> gsl_vector *v</tt>

Here, we create an index vector [0, 1, 2, $\dots$].

double index(double in, int index){return index;}
apop_data *d = apop_data_alloc(100, 0, 0);
apop_map(d, .fn_di=index, .inplace=1);

Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:

static double your_log_likelihood_fn(const gsl_vector * in){[your math goes here]}

double total_log_likelihood = apop_matrix_map_sum(dataset, your_log_likelihood_fn);

How many missing elements are there in your data matrix?

static double nan_check(const double in){ return isnan(in);}

    int nan_count_data = apop_map_sum(in, nan_check, .part='m');

Get the mean of the not-NaN elements of a data set:

static double no_nan_val(const double in){ return isnan(in)? 0 : in;}
static double nan_check(const double in){ return isnan(in);}

static double apop_mean_no_nans(apop_data *in){
    return apop_map_sum(in, no_nan_val)/apop_map_sum(in, nan_check);
}

This program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the $t$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.

Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to the constant parameters to input to the functions.

#include <apop.h>

double row_offset;
gsl_rng *r;

void offset_rng(double *v){*v = gsl_rng_uniform(r) + row_offset;}
double find_tstat(gsl_vector *in){ return apop_mean(in)/sqrt(apop_var(in));}
double conf(double in, void *df){ return gsl_cdf_tdist_P(in, *(int *)df);}

int main(){
    apop_data *d = apop_data_alloc(0, 10, 100);
    r = apop_rng_alloc(3242);
    for (int i=0; i< 10; i++){
        row_offset = gsl_rng_uniform(r)*2 -1;
        Apop_row(d, i, onerow);
        apop_vector_apply(onerow, offset_rng);
    }

    size_t df = d->matrix->size2-1;
    apop_data *means = apop_map(d, .fn_v = apop_vector_mean, .part ='r');
    apop_data *tstats = apop_map(d, .fn_v = find_tstat, .part ='r');
    apop_data *confidences = apop_map(tstats, .fn_dp = conf, .param = &df);

    printf("means:\t\t"); apop_data_show(means);
    printf("t stats:\t"); apop_data_show(tstats);
    printf("confidences:\t"); apop_data_show(confidences);
}

pip Basic Math

See also:

  • int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)
  • int gsl_matrix_scale (gsl_matrix * a, const double x)
  • int gsl_matrix_add_constant (gsl_matrix * a, const double x)
  • gsl_vector_add (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_sub (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_mul (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_div (gsl_vector * a, const gsl_vector * b)
  • gsl_vector_scale (gsl_vector * a, const double x)
  • gsl_vector_add_constant (gsl_vector * a, const double x)

pip Matrix math

pip Summary stats

See also:

  • double gsl_matrix_max (const gsl_matrix * m)
  • double gsl_matrix_min (const gsl_matrix * m)
  • void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)
  • void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)
  • void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)
  • void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)
  • gsl_vector_max (const gsl_vector * v)
  • gsl_vector_min (const gsl_vector * v)
  • gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)
  • gsl_vector_max_index (const gsl_vector * v)
  • gsl_vector_min_index (const gsl_vector * v)
  • gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)

pip Moments

pip Conversion among types

pip Name handling

pip Generating factors

Factor is jargon for a numbered category. Number-crunching programs work best on numbers, so we need a function to produce a one-to-one mapping from text categories into numeric factors.

A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors.

pip Databases

These are convenience functions to handle interaction with SQLite or mySQL. They open one and only one database, and handle most of the interaction therewith for you.

You will probably first use apop_text_to_db to pull data into the database, then apop_query to clean the data in the database, and finally apop_query_to_data to pull some subset of the data out for analysis.

See also the Conversion functions, including apop_text_to_db and apop_matrix_to_db.

P.S.
Apophenia reserves the right to insert temp tables into the opened database. They will all have names beginning with "apop_", so the reader is advised to not use tables with such names, and is free to ignore or delete any such tables that turn up.

pip Out

pip In

See the print functions below. By setting apop_opts.output_type = 'd', apop_data sets (or gsl_matrixes and gsl_vectors) are `printed' to the database.

pip Models

pip Introduction

A model intermediates between data, parameters, and likelihoods. [This definition readily includes "non-parametric" models.] When the parameters are fully specified, the model will give you a likelihood that an observation will occur. Much of statistical analysis consists of writing down a model, estimating its parameters, and running hypothesis tests to determine the confidence with which we can make statements about those parameters.

Given data, the model can estimate the parameters that best fit the data. This is the typical model estimation.

Given parameters, the model can generate the expected value of the data, or randomly generate a full artificial data set.

Apophenia facilitates all this via its apop_model objects. The model includes slots for log_likelihood, estimate, draw, expected_value, and other operations we expect from our models.

For example, a model may be a probability distribution, such as the apop_normal, apop_poisson, or apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: apop_estimate(data,apop_normal).

The design of the objects hopes to make it as easy as possible for you, dear reader, to write new models. For the most part, all you need to do is write a log likelihood function, and apop_maximum_likelihood does the rest; see below.

The most commonly-used function in the systems below is the apop_estimate function. It takes in a model and data, and outputs an apop_estimate, that includes the parameter estimates and the various auxiliary data that one may need to test the estimates, such as the variance-covariance matrix. For most users, the estimate function will be all one needs from a model. Just prep the data, select a model, and produce an estimate:

apop_data          *data               = read_in_data();
apop_model     *the_estimate  = apop_estimate(data, apop_probit);
apop_model_show(the_estimate);

pip The internals

model.png

The apop_model struct breaks down into three parts:

  • Info like names, the pointer to the input data, and the parameters, are for the most part self-descriptive.
  • There is a broad class of functions that cover most of what you would do with a model. You can see that there is a bit of a bias toward maximum likelihood estimation. There are helper functions for most of them, like the apop_estimate function above, that meant that you never had to directly handle the model's estimate method. In the list at the top of this page, all of the functions whose last argument is a model are helper functions of this type. The helper functions do some boilerplate error checking, and mean that you don't have to fill in every blank in your model: if you have a log_likelihood method but no p method, then apop_p will use exp(log_likelihood). If you don't give an estimate method, then apop_estimate will use maximum likelihood.
  • I refer to the values output from an estimation as parameters, and the details of how the model's machinery work as settings. The parameters are pretty standardized, and a apop_data set is sufficient to handle the great majority of models. However, the settings of an OLS regression are drastically different from those of a histogram or an MLE. The solution is a list of settings structures. This is probably the least pleasant structural element in the system, to the point that you will not want to handle the settings structures directly. Instead, you should use the various helper functions for model settings.

pip Writing your own

Writing apop_model objects is easy, because (almost) everything has a default of some sort, so you can fill in those segments you know, and let the system use computationally-intensive methods for the rest.

For example, here is how one would set up a model for an MLE:

  • Write a likelihood function. Its header will look like this:
    double apop_new_log_likelihood(apop_data *data, apop_model *m)
    
    where data is the input data, and m is the parametrized model (i.e. your model with a parameters element set by the caller). This function will return the value of the log likelihood function at the given parameters.
  • Is this a constrained optimization? See the Constraints page on how to set them. Otherwise, no constraints will be assumed.
  • Write the object. In your header file, include
    apop_model your_new_model = {"The Me distribution", number_of_parameters, 
                .estimate = new_estimate, .log_likelihood = new_log_likelihood };
    
    If there are constraints, add an element for those too.

number_of_parameters is probably a positive integer like 2, but it is often (the number of columns in your data set) -1, in which case, set number_of_parameters to -1.

You already have enough that something like

apop_model *estimated = apop_mle(your_data, your_new_model);

will work.

For many methods, the first thing you will write is the random number generator. Other elements, such as the gradient---

void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){
    //some algebra here to find df/dp0, df/dp1, df/dp2....
    gsl_vector_set(gradient,0, d_0);
    gsl_vector_set(gradient,1, d_1);
}

---or the expected value are optional.

pip Models that ship with Apophenia

pip Basic object methods

pip Methods for computation

pip Model settings

[For info on specific settings groups and their contents and use, see the Settings page.]

pip Intro for model users

Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of slots to be filled with data.

So, the apop_model includes a single list, whose name is simply settings, which can hold an arbitrary number of groups of settings. For example, you can have a set of search specifications for finding the maximum likelihood, and a histogram for making random draws.

Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all.

If you do need to tweak a setting, you will need its location. Think of each model having a row of baskets, such as the apop_mle_settings and the apop_histogram_settings baskets. To find a single setting, like the MLE's tolerance setting, you would need to give the model, which basket to look in, and the setting. E.g.,

double tol = Apop_settings_get(your_model, apop_mle, tolerance);
Apop_settings_set(your_model, apop_mle, tolerance, 1e-5);

Notice that we don't need the _settings ending to the settings group's name---macros make it happen.

But you are probably going to set all of a model's settings at once when you first use it. Here is a full example:

1 apop_data *data = your_data_here;
2 apop_data *w = your_weights_here;

3 apop_model *m = apop_model_copy(apop_wls);
4 Apop_model_add_group(m, apop_ls, .weights=w, .want_cov='y');
5 apop_model *est = apop_estimate(data, *m);

Line three establishes the baseline form of the model. Line four adds a settings group of type apop_lm_settings to the model, and specifies that we want it initialized with the weights element set to the data set w, and the want_cov element set to 'y'. Unlike the single-setting macros above, Apop_model_add_group follows the Designated initializers syntax of the form .setting=value.

Having set the settings, line 5 does the weighted OLS.

Also, the output to any of Apophenia's estimations will have an appropriate group of settings allocated, so you can chain estimations pretty easily. Continuing the above example, you could re-estimate with an alternative set of weights via:

Apop_settings_set(est, apop_ls, weights, weight_set_two);
apop_model *est2 = apop_estimate(data, *est);
  • The list of elements of the settings structures included in Apophenia are on the Settings page.
  • Notice the use of a single capital to remind you that you are using a macro, so you should beware of the sort of surprising errors associated with macros. Here in the modern day, we read things like APOP_SETTINGS_ADD as yelling, but if you prefer all caps to indicate macros, those work as well.
  • There are two additional macros which are now deprecated: Apop_settings_add_group and Apop_settings_alloc_add. They made sense at the time. That's why the Apop_model_add_group macro doesn't have settings in the name. The Apop_settings_add macro is equivalent to Apop_settings_set.

For just using a model, that's about 100% of what you need to know.

pip Writing new settings groups

To store the settings for your own models, you don't necessarily need any of this. The apop_model structure has a void pointer named more which you can use to store extra information as needed. If more_size is larger than zero (i.e. you set it to your_model.more_size=sizeof(your_struct)), then it will be copied via memcpy by apop_model_copy, and freed by apop_model_free. Apophenia's estimation routines will never impinge on this item, so do what you feel with it.

If you do want to set up a new settings group, then you will need four items. This is the sort of boilerplate that will be familiar to users of object oriented languages in the style of C++ or Java. Let your settings group be named ysg; then you will need

  • The settings struct
    typedef struct {
    ...
    } ysg_settings
    
  • The allocate function
    ysg_settings *ysg_settings_init(ysg_settings in){ 
        ysg_settings *out = malloc(sizeof(ysg_settings));
        apop_varad_setting(in, out, want_cov,  'y');
        return out; }
    
    I included an example of the use of apop_varad_settings, a convenience macro to ease merging the input settings group with defaults. It checks for the given element (here, want_cov), and if that element is not found in the input, sets it to the specified value (here, 'y').
  • The copy function
    void *ysg_settings_copy(ysg_settings *copyme) {
        ysg_settings *out = malloc(sizeof(ysg_settings));
        //copy elements here. If you have no pointers or other trickery, just use:
        *out = *copyme
        return out; }
    
  • The free function, which can be as brief as:
    void ysg_settings_free(ysg_settings *freeme) {
        free(freeme);
    }
    
    but may include a freeing of pointed-to subelements as necessary.

The names are not negotiable: when you call

the macro will look for ysg_settings, ysg_settings_init, et cetera.

The lines-of-code averse will cringe at having to write such boilerplate code (I do), but after spending a year resisting it, I have to concede that it's the least of all evils.

You can retrieve the whole group or individual elements via:

Apop_settings_get_group(m, ysg)
//as well as the one-element macro mentioned above,
Apop_settings_get(m, ysg, an_element)

As you saw above, once the typedef/alloc/copy/free machinery is written, you can declare, get, and set in a reasonably graceful manner.

  • For efficiency and other reasons, you may want your copy routine to copy pointers to large data sets rather than copying the data itself. But because so much copying goes on, you will need to be careful that any allocated data is freed only once. The easiest way to do this is to include an owner element in your data, which is set to one in the init routine but to zero in the copy routine, then free pointers only if owner == 1. Here is an example that either takes in a data set or defaults to initializing a new one; either way, copies of the model will all point to the same data set.
typedef struct {
    apop_data *dataset;
    int owner;
} ysg_settings;

ysg_settings *ysg_settings_init(ysg_settings in){ 
    ysg_settings *out = malloc(sizeof(ysg_settings));
    apop_varad_settings(in, out, dataset, apop_data_alloc(100, 100, 100));
    out.owner = 1;
    return out; }

void *ysg_settings_copy(ysg_settings *copyme) {
    ysg_settings *out = malloc(sizeof(ysg_settings));
    *out = *copyme; //pointer to data was copied, not underlying data.
    out.owner = 0;
    return out; }

void ysg_settings_free(ysg_settings *freeme) {
    if (freeme.owner)
        apop_data_free(freeme->dataset);
    free(freeme);
}

pip Estimation aids

pip Tests & diagnostics

Just about any hypothesis test consists of a few common steps:

If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated $t$ test is probably thrown in.

Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like apop_test_anova_independence and apop_test_kolmogorov, to produce the statistic and look up its significance level.

If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.

See also the in-database tests above.

pip Monte Carlo methods

To give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:

// Input: any old vector. Output: 1 - the p-value for a chi-squared
// test to answer the question, "with what confidence can I reject the
// hypothesis that the variance of my data is zero?"

double apop_test_chi_squared_var_not_zero(const gsl_vector *in){
  apop_assert(in, 0, 0, 'c', "input vector is NULL. Doing nothing.\n");
  gsl_vector   *normed;
  double       sum=0;
     apop_vector_normalize((gsl_vector *)in,&normed, 1);
     gsl_vector_mul(normed,normed);
     for(size_t i=0;i< normed->size; 
               sum +=gsl_vector_get(normed,i++));
     gsl_vector_free(normed);
     return gsl_cdf_chisq_P(sum,in->size); 
}

Or, consider the Rao statistic, ${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)$ where $L$ is your model's likelihood function and $I$ its information matrix. In code:

apop_data * infoinv = apop_model_numerical_covariance(data, your_model);
apop_data * score;
apop_score(data, score->vector, your_model);
apop_data * stat    = apop_dot(apop_dot(score, infoinv), score);

Given the correct assumptions, this is $\sim \chi^2_m$, where $m$ is the dimension of $\beta$, so the odds of a Type I error given the model is:

    double p_value = apop_test(stat, "chi squared", beta->size);

pip Histograms

The GSL provides a gsl_histogram structure, that produces a PDF by accumulating data into bins. Apophenia wraps this into its apop_model struct, so that the model-family machinery can be applied to Bayesian updating, kernel smoothing, or other methods that output a PDF.

To produce a PMF from your_data, use apop_estimate(your_data, apop_histogram). If you would like to compare this histogram to other data (observed or theoretical),then you'll need to produce a second synced histogram using apop_histogram_vector_reset or apop_histogram_model_reset. Then you can send both histograms to, say, apop_test_kolmogorov.

The second structure from the GSL incrementally sums up the PMF's bins to produce a CMF. The CMF can be used to map from a draw from a Uniform[0,1] to a draw from the PMF. Because it can be used to draw from the PMF, the GSL calls this the gsl_histogram_pdf structure. That's right: the data in the gsl_histogram_pdf structure is a cumulative sum---a CMF.

Anyway, here are some functions to deal with these various histograms and such; see also the GSL documentation, linked at the bottom of this outline.

pip Maximum likelihood methods

If you read the section on writing models, then you already know how to do maximum likelihood on exotic setups. Just write a model that has a p or log_likelihood function, and call apop_estimate(your_data,your_model). The default estimation routine is maximum likelihood.

MLEs have an especially large number of parameter tweaks that could be made; see the section on MLE settings above.

The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.

The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.

If you give it a likelihood function with no regard to constraints plus an array of constraints, apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.

A constraint function must do three things:

The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero:

static double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){
    //constraint is 0 < beta_2
  static apop_data *constraint = NULL;
    if (!constraint){
        constraint= apop_data_calloc(1,1,1);
        apop_data_set(constraint, 0, 0, 1);
    }
    return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);
}

pip More descriptive methods

pip Missing data

pip Legible output

Most of the printing options are intended to operate in different modes (as appropriate). The idea is that you will probably working in one specific mode for a while---writing to screen, a file, or the database---and so can just set one global option reflecting this:

apop_opts.output_type = 's'; //Stdout
apop_opts.output_type = 'f'; //named file
apop_opts.output_type = 'p'; //a pipe or already-opened file
apop_opts.output_type = 'd'; //the database

C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice:

apop_opts.output_type = 'p';
apop_opts.output_pipe = popen("gnuplot", "w");
apop_plot_lattice(...);
fclose(apop_opts.output_pipe);
apop_opts.output_pipe = fopen("newfile", "w");
apop_data_print(set1);
fprintf(apop_opts.output_pipe, "\nNow set 2:\n");
apop_data_print(set2);

Continuing the example, you can always override the global data with a specific request:

apop_vector_print(v, "vectorfile"); //put vectors in a separate file
apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db
apop_matrix_print(m, .outpipe = stdout);  //now show the same matrix on screen

I will first look to the input file name, then the input pipe, then the global output_pipe, in that order, to determine to where I should write. Some combinations (like output type = 'd' and only a pipe) don't make sense, and I'll try to warn you about those.

The plot functions produce output for Gnuplot (so output type = 'd' again does not make sense). As above, you can pipe directly to Gnuplot or write to a file.

pip Assorted

pip General utilities

pip Math utilities

pip Deprecated

These functions will probably disappear or be replaced soon.

pip Further references

For your convenience, here are links to some other libraries you are probably using.

SourceForge.net Logo

Autogenerated by doxygen on 14 Mar 2010.