![]() |
|
Expand all | Collapse all
Getting startedIf you are entirely new to Apophenia, have a look at the Gentle Introduction here.
As well as the information in this outline, there is a separate page covering the details of setting up a computing environment and another page with some sample code for your perusal.
For another concrete example of folding Apophenia into a project, have a look at this sample program.
Some notes on C and Apophenia's use of C utilities.
Learning CModeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything.
Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.
Usage notesHere are some notes about the technical details of using the Apophenia library in your development environment.
Header aggregation
If you put
#include <apop.h>
at the top of your file, then it will call virtually every header file you could need: gsl_matrix.h, gsl_blas.h, sqlite3.h, stdio.h, string.h, math.h, apophenia_all_of_them.h, et cetera. Of course, if you get `implicit declaration of...' then you will need to manually include something else. Bear in mind that every book on C will tell you this is bad form and you shouldn't do it.
Easier calling syntax
Several functions (but nowhere near all) allow optional named arguments to functions. For example:
apop_vector_distance(v1, v2); //assumes Euclidean distance apop_vector_distance(v1, .metric='M'); //assumes v2=0, uses Manhattan metric.
See the Designated initializers page for details of this syntax.
LibrariesYour best bet is to write yourself a Makefile. If you don't want to use the sample Makefile, then here are some notes for the command line. When compiling, you will need to tell the compiler to use the Apophenia library. That is, you will need to call GCC with gcc -lapophenia (as well as the other usual flags). For example, gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g will produce an output file named run_me from the input source code sample.c. It will include symbols for debugging (-g) and will correctly find the functions in the Apophenia, GSL, and SQLite libraries (-lapophenia -lgsl ...). Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, gcc sample.c -lapophenia will work, while gcc -lapophenia sample.c is likely to give you errors. Similarly, list -lapophenia before -lgsl, which comes before -lgslcblas.
DebuggingThe global variable apop_opts.verbose turns on some diagnostics, such as printing the query sent to the database engine (which is useful if you are substituting in many %ses). There are several levels to the verbosity:
-1: silent
0: notify only of failures and clear danger
1: warn of technically correct but odd situations that might indicate, e.g., numeric instability
2: debugging-type information; print queries
3: give me everything
These levels are of course subjective, but should give you some idea of where to place the verbosity level. The default is 1.
The global variable apop_opts.stop_on_warning turns any warning into a halt via abort(). Use this to get your debugger to stop on the warning. If the verbosity level is too low to show the warning, the halt happens anyway.
If you use gdb, you can define macros to use the pretty-printing functions on your data, which can be a significant help. Add these to your .gdbinit:
define pv
p apop_vector_show($arg0)
end
define pm
p apop_matrix_show($arg0)
end
define pd
p apop_data_show($arg0)
end
define pa
p *($arg0)@$arg1
end
Then just pd mydata to see all components of your data set neatly displayed, or pa myarray 5 to see the first five elements of your array.
Syntax highlightingIf your text editor supports syntax highlighting, there are a few types defined in the Apophenia and GSL headers which may be worth coloring. E.g., for vim, add the following two lines to /usr/share/vim/syntax/c.vim:
syn keyword cType gsl_matrix gsl_rng gsl_vector apop_data syn keyword cType apop_name apop_model
Other text editors have similar files to which you can add the above types.
The Python interfaceThe distribution includes a Python interface via the SWIG tool for bridging across languages.
Installation: You will need to have the SWIG and Python-development packages installed on your computer when compiling. From there,
./configure --enable-python make sudo make install
will produce the Python library and install it in Python's site-packages directory.
Sample script:
On the page on the apop_ols model, you will find a few lines of toy data and a sample program to run an OLS regression. Once you have set up the data file, you can use this Python rewrite of the OLS program:
from apop import * apop_text_to_db("data", "d", 0, 1, None) data = apop_query_to_data("select * from d") est = apop_estimate(data, avar.apop_ols) print est #And one last example of using a method of the apop_data object: print est.parameters.transpose()
apop_ols variable had to be called avar.apop_ols.apop_vector_skew(my_v), you would call my_v.skew(). If you want a list of methods, just use help(my_v) or any of the other familiar means of introspection.Here is another simple example, that copies a Python-side list into a matrix using apop_pylist_to_data, and then runs a Fisher Exact test on the matrix. If the list were one-dimensional (flat), then the data would be copied into the vector element of the returned apop_data structure.
from apop import * data = [[15, 13], [18, 19]] adata = apop_pylist_to_data(data) print "display the data" print adata.matrix result = apop_test_fisher_exact(adata) print result print "Or, just one value:" print result.get("p value", -1)
SQL, the syntax for querying databasesFor a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. The blog of Apophenia's author includes an entry about complementaries between SQL and matrix manipulation packages.
Apophenia currently supports two database engines: SQLite and mySQL. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.
You can switch to mySQL two ways: set apop_opts.db_engine = 'm', or set the environment variable APOP_DB_ENGINE=mysql. Otherwise, the system will use SQLite. Ideally, after you make this switch, you need make no other changes--- apop_query, apop_query_to_data, apop_table_exists, et cetera, will work as before.
Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see Database moments (plus pow()!).
The book versionApophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org.
If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.
As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry, which should be be easy to re-shape for other environments:
@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}
What is the status of the code?[This section last updated 21 February 2011.]
Apophenia was first posted to SourceForge in February 2005, which means that we've had several years to develop and test the code in real-world applications.
It is currently at version 0.99, which is intended to indicate that it is substantially complete. Of course, a library for scientific computing, or even for that small subset that is statistics, will never cover all needs and all methods. But as it stands Apophenia's framework, based on the apop_data and apop_model, is basicaly internally consistent, has enough tools that you can get common work done quickly, and is reasonably fleshed out with a good number of models out of the box.
The apop_data structure is set, and there are enough functions there that you could use it as a subpackage by itself (along with the database functions) for nontrivial dealings with data.
The apop_model structure is much more ambitious, and although its interface is set (pending your comments and feedback), its internal system can still be improved. The promise underlying the structure is that you can provide just one item, such as an RNG or a likelihood function, and the structure will do all of the work to fill in computationally-intensive methods for everything else. Some directions aren't quite there yet (such as RNG -> most other things); the default likelihood -> RNG function only works for models with a domain of unidimensional reals; the PMF model (the essential bridge from data sets to empirical models) needs an internal index for faster lookups.
Which parts of the code are most reliable? There are vigorous tests on the code base, which currently cover about 73% of the lines of code; most of the untested part is in methods of some of the more obscure models. A broad rule of thumb for any code base is that the well-worn parts, in this case functions like apop_data_get and apop_normal's log_likelihood, are likely to be entirely reliable, while the out-of-the-way functions (maybe the RNG for the Yule distribution) are worth a bit of caution. Close to all of the code has been used in production, so all of it was at least initially tested against real-world data.
How do I write extensions?It's not a package, so you don't need an API---write your code and #include it like any other C code. The system is explicity written to not require a registration or initialization step to add a new model or other such parts. You can read the notes below on generating new models, which have to conform to some rules if they are to play well with apop_estimate, apop_draw, and so forth.
You are encouraged to send new models or functions to the maintainer, or send the maintainer a URL if you prefer to host the code yourself; I'm happy to add references in the appropriate places.
Data setsThe apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.
For example, let us say that you are running a regression: there is a vector for the one dependent variable, and a matrix for the several independent variables. Think of them as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some code to print the entire matrix. Notice that the column counter i starts counting at -1.
for (j = 0; j< data->matrix->size1; j++){ printf("%s\t", apop_name_get(data->names, j, 'r')); for (i = -1; i< data->matrix->size2; i++) printf("%g\t", apop_data_get(data, j, i)); printf("\n"); }
We're generally assuming that the data vector and data matrix have the same row count: data->vector->size==data->matrix->size1 . This means that the apop_name structure doesn't have separate vector_names and row_names elements: the rownames are assumed to apply for both.
The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information: estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. apop_data_copy and apop_data_free will handle all the pages of information. The more pointer is not intended as a linked list for millions of data points---you can probably find a way to restructure your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).
Easy data manipulation is essential for enjoying life as a researcher. Thus, there are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.
Note: Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. You will find a link to the full GSL documentation at the end of this outline, and the text of the outline will list the names a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient for your needs.
gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j) gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j) gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j) gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src) gsl_matrix_transpose (gsl_matrix * m) : square matrices only gsl_matrix_set_all (gsl_matrix * m, double x) gsl_matrix_set_zero (gsl_matrix * m) gsl_matrix_set_identity (gsl_matrix * m) void gsl_vector_set_all (gsl_vector * v, double x) void gsl_vector_set_zero (gsl_vector * v) int gsl_vector_set_basis (gsl_vector * v, size_t i): set all elements to zero, but set item
to one. gsl_vector_reverse (gsl_vector * v): reverse the order of your vector's elements
Alloc/freeSee also:
gsl_vector * gsl_vector_alloc (size_t n) gsl_vector * gsl_vector_calloc (size_t n) void gsl_vector_free (gsl_vector * v) gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)
Using viewsYou may need to pull a row of a matrix as a distinct vector. The GSL makes it easy to do so, and Apophenia provides several convenience functions to save any remaining hassle. The data in the new vector will point to the base data, not copy it. Some examples:
apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table"); gsl_matrix *m = d->matrix; //Get a column using its name Apop_col_t(d, "obs1", ov); double obs1_sum = apop_vector_sum(ov); //Get a row using its index Apop_row(d, 0, v); double first_row_sum = apop_vector_sum(v); //First column's sum, matrix format Apop_matrix_col(m, 0, mv); double first_col_sum = apop_vector_sum(mv); //Pull a ten by five submatrix, whose first element is the (2,3)rd //element of the parent data set's matrix Apop_submatrix(d, 2,3, 10,5, subm); double first_col_sum = apop_matrix_sum(subm);
These macros make use of a set of GSL matrices that produce output of type gsl_matrix_view and gsl_vector_view. These types point to the source data, and add metadata to turn the data into a coherent matrix/vector. Apophenia's macros generate these views, then pull the matrix/vector from them, so you never have to deal with the _view structs directly.
The Apop_data_rows and Apop_data_row macros are intended to be a view of one or more rows of an apop_data set.
If the macros don't work for you, you can use the GSL's macros directly. Here's how to get the fifth row of a_matrix into a vector view:
gsl_vector_view v; v = gsl_matrix_col(a_matrix, 4);
For rows, use gsl_matrix_row(a_matrix, n). The vector view is a data structure which includes an element of type gsl_vector named vector; this is the only element you will be interested in. The expression &(v.vector) is of type gsl_vector *, and therefore can be used as you would any other pointer to a gsl_vector. For example, try apop_mean(&(v.vector));.
The view is intended to be a common variable, not a pointer. If you want to retain the data after the function exits, copy it to another vector:
gsl_vector_view v; gsl_vector *a_new_vector = gsl_vector_alloc(a_matrix->size1); v = gsl_matrix_col(a_matrix, 4); gsl_vector_memcpy(a_new_vector, &(v.vector));
What would happen if you return &(v.vector) to a calling function? The calling function now has a pointer to the space where v.vector had been, but because v is an automatically allocated variable, that address may now hold garbage.
One more reminder: curly braces delimit scope. These macros work by generating a number of local variables, which you may be able to see in your debugger. When program evaluation exits a given block, all variables in that block are erased. Here is some sample code that won't work:
if (get_odd){ Apop_data_row(data, 1, outdata); } else { Apop_data_row(data, 0, outdata); } apop_data_show(outdata); //breaks: no outdata in scope.
For this if/then statement, there are two sets of local variables generated: one for the if block, and one for the then block. By the last line, neither exists. You can get around the problem here by making sure to not put the macro declaring new variables in a block. E.g.:
Apop_data_row(data, get_odd ? 1 : 0, outdata); apop_data_show(outdata);
This is a general rule about how variables declared in blocks will behave, but because the macros obscure the variable declarations, it is especially worth watching out for here.
Set/getThe set/get functions can act on both names or indices. Sample usages:
double twothree = apop_data_get(data, 2, 3); //just indices apop_data_set(data, .rowname="A row", .colname="this column", .val=13); double AIC = apop_data_get(data, .rowname="AIC", .col=-1, .page="Info");
If you are new to Apophenia and reading this outline sequentially, you may be surprised to see that these function calls are ISO standard-compliant C. See the notes on Designated initializers for a full explanation.
See also:
double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j) double gsl_vector_get (const gsl_vector * v, size_t i) void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x) void gsl_vector_set (gsl_vector * v, size_t i, double x) double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j) double * gsl_vector_ptr (gsl_vector * v, size_t i) const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j) const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i) gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i) gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j) gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v) gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)
Map/apply These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.
There is an older and a newer set of functions. The older versions, which act on gsl_matrixes or gsl_vectors have more verbose names; the newer versions, which act on the elements of an apop_data set, use the Designated initializers syntax to ad a few options and a more brief syntax.
You can do many things quickly with these functions.
Get the sum of squares of a vector's elements:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2); //given <tt> apop_data *dataset</tt> double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2); //given <tt> gsl_vector *v</tt>
Here, we create an index vector [0, 1, 2,
].
double index(double in, int index){return index;} apop_data *d = apop_data_alloc(100, 0, 0); apop_map(d, .fn_di=index, .inplace='y');
Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:
static double your_log_likelihood_fn(const gsl_vector * in){[your math goes here]} double total_log_likelihood = apop_matrix_map_sum(dataset, your_log_likelihood_fn);
How many missing elements are there in your data matrix?
static double nan_check(const double in){ return isnan(in);} int nan_count_data = apop_map_sum(in, nan_check, .part='m');
Get the mean of the not-NaN elements of a data set:
static double no_nan_val(const double in){ return isnan(in)? 0 : in;} static double nan_check(const double in){ return isnan(in);} static double apop_mean_no_nans(apop_data *in){ return apop_map_sum(in, no_nan_val)/apop_map_sum(in, nan_check); }
The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the
statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to the constant parameters to input to the functions.
#include <apop.h> double row_offset; gsl_rng *r; void offset_rng(double *v){*v = gsl_rng_uniform(r) + row_offset;} double find_tstat(gsl_vector *in){ return apop_mean(in)/sqrt(apop_var(in));} double conf(double in, void *df){ return gsl_cdf_tdist_P(in, *(int *)df);} int main(){ apop_data *d = apop_data_alloc(0, 10, 100); r = apop_rng_alloc(3242); for (int i=0; i< 10; i++){ row_offset = gsl_rng_uniform(r)*2 -1; Apop_row(d, i, onerow); apop_vector_apply(onerow, offset_rng); } size_t df = d->matrix->size2-1; apop_data *means = apop_map(d, .fn_v = apop_vector_mean, .part ='r'); apop_data *tstats = apop_map(d, .fn_v = find_tstat, .part ='r'); apop_data *confidences = apop_map(tstats, .fn_dp = conf, .param = &df); printf("means:\t\t"); apop_data_show(means); printf("t stats:\t"); apop_data_show(tstats); printf("confidences:\t"); apop_data_show(confidences); }
One more toy example, demonstrating the use of the Apop_data_row :
/* This sample code sets the elements of a data set's vector to one if the index is even. Then, via the weights vector, it adds up the even indices. There is really no need to use the weights vector; this code snippet is an element of Apophenia's test suite, and goes the long way to test that the weights are correctly handled. */ double set_vector_to_even(apop_data * r, int index){ apop_data_set(r, 0, -1, 1 - (index %2)); return 0; } double set_weight_to_index(apop_data * r, int index){ gsl_vector_set(r->weights, 0, index); return 0; } double weight_given_even(apop_data *r){ return gsl_vector_get(r->vector, 0) ? gsl_vector_get(r->weights, 0) : 0; } void test_apop_map_row(){ apop_data *d = apop_data_alloc(100); d->weights = gsl_vector_alloc(100); apop_map(d, .fn_ri=set_vector_to_even, .inplace='y'); apop_map(d, .fn_ri=set_weight_to_index, .inplace='y'); double sum = apop_map_sum(d, .fn_r = weight_given_even); assert(sum == 49*25*2); }
Basic MathSee also:
int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_scale (gsl_matrix * a, const double x) int gsl_matrix_add_constant (gsl_matrix * a, const double x) gsl_vector_add (gsl_vector * a, const gsl_vector * b) gsl_vector_sub (gsl_vector * a, const gsl_vector * b) gsl_vector_mul (gsl_vector * a, const gsl_vector * b) gsl_vector_div (gsl_vector * a, const gsl_vector * b) gsl_vector_scale (gsl_vector * a, const double x) gsl_vector_add_constant (gsl_vector * a, const double x)
Matrix math
matrix, matrix
vector, or vector
matrix
Summary statsSee also:
double gsl_matrix_max (const gsl_matrix * m) double gsl_matrix_min (const gsl_matrix * m) void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out) void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax) void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin) void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax) gsl_vector_max (const gsl_vector * v) gsl_vector_min (const gsl_vector * v) gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out) gsl_vector_max_index (const gsl_vector * v) gsl_vector_min_index (const gsl_vector * v) gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)
Moments
Conversion among types
Name handling
Text dataThe apop_data set includes a grid of strings, text, for holding text data.
#include <apop.h> int main(){ apop_query("create table data (name, city, state);" "insert into data values ('Mike Mills', 'Rockville', 'MD');" "insert into data values ('Bill Berry', 'Athens', 'GA');" "insert into data values ('Michael Stipe', 'Reckoning', 'GA');"); apop_data *tdata = apop_query_to_text("select name, city, state from data"); printf("Customer #1: %s\n\n", *tdata->text[0]); //equivalent to tdata->text[0][0] printf("The data, via apop_data_print:\n"); apop_data_print(tdata); //the text alloc can be used as a text realloc: apop_text_alloc(tdata, 1+tdata->textsize[0], tdata->textsize[1]); apop_text_add(tdata, *tdata->textsize-1, 0, "Peter Buck"); apop_text_add(tdata, *tdata->textsize-1, 1, "Atlanta"); apop_text_add(tdata, *tdata->textsize-1, 2, "GA"); printf("\n\nAugmented data, printed via for loop (for demonstration purposes):\n"); for (int i=0; i< tdata->textsize[0]; i++){ for (int j=0; j< tdata->textsize[1]; j++) printf("%s\t", tdata->text[i][j]); printf("\n"); } apop_data *states = apop_text_unique_elements(tdata, 2); char *states_as_list = apop_text_paste(states, .between=", "); printf("\n States covered: %s\n", states_as_list); }
Generating factorsFactor is jargon for a numbered category. Number-crunching programs work best on numbers, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors. The functions that generate factors and dummies will add an informational page to your apop_data set with a name like <categories for="" your_column>=""> listing the conversion from the artificial numeric factor to the original data.
DatabasesThese are convenience functions to handle interaction with SQLite or mySQL. They open one and only one database, and handle most of the interaction therewith for you.
You will probably first use apop_text_to_db to pull data into the database, then apop_query to clean the data in the database, and finally apop_query_to_data to pull some subset of the data out for analysis.
See also the Conversion functions, including apop_text_to_db and apop_matrix_to_db.
Out
InSee the print functions below. By setting apop_opts.output_type = 'd', apop_data sets (or gsl_matrixes and gsl_vectors) are `printed' to the database.
Models
IntroductionBegin with the most common use: the estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:
apop_data *data = read_in_data(); apop_model *the_estimate = apop_estimate(data, apop_probit); apop_model_show(the_estimate);
Along the way to estimating the parameters, most models also find covariance estimates for the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.
More estimation outputA call to apop_estimate produces more than just the estimated parameters. Most will produce any of a covariance matrix, some hypothesis tests, a list of expected values, log likelihood, AIC, BIC, et cetera.
First, note that if you don't want all that, adding to your model an apop_parts_wanted_settings group with its default values signals to the model that you want only the parameters and to not waste CPU time on covariances, expected values, et cetera. See the apop_parts_wanted_settings documentation for examples and further refinements.
your_model->parameters.info group, and can be retrieved via a form like apop_data_get(your_model->info, .rowname="log likelihood"); //or apop_data_get(your_model->info, .rowname="AIC");
apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");
apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");
But we expect much more from a model than just estimating parameters from data.
Continuing the above example where we got an estimated Probit model named the_estimate, we can interrogate the estimate in various familiar ways. In each of the following examples, the model object holds enough information that the generic function being called can do its work:
apop_data *expected_value = apop_predict(NULL, the_estimate); double density_under = apop_cdf(expected_value, the_estimate); gsl_rng *rng = apop_rng_alloc(234211); apop_data *draws = apop_data_alloc(100, the_estimate->data->matrix->size1); for(int i=0; i< 100; i++){ Apop_row(draws, i, onerow); apop_draw(&onerow->data, rng, the_estimate); }
Apophenia ships with many well-known models for your immediate use, including probability distributions, such as the apop_normal, apop_poisson, or apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: apop_estimate(your_data, apop_normal).
There are also linear models like apop_ols, apop_probit, and apop_logit (where those last two are multinomial or binomial as the input data dictates). As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.
Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.
But some models have to break uniformity, like how a histogram has a list of bins that makes no sense for a Normal distribution. These are held in settings groups, which you will occasionally need to tweak to modify how a model is handled or estimated. The most common example would be for maximum likelihood,
//Probit uses mle, so redo the estimation using Newton's Method Apop_model_add_group(the_estimate, apop_mle, .verbose='y', .tolerance=1e-4, .method=APOP_RF_NEWTON); apop_model *re_est = apop_estimate(data, the_estimate);
See below for the details of using settings groups.
Where to from here? You will find a list of the canned models already available below, along with a list of basic functions that make use of them. If your model is there, then you are encouraged to not delve into the internal details of the model structure. [For example, the apop_probit structure has an apop_probit.estimate function, but don't call it directly. Always use apop_estimate and let it call the model-internal method.]
But if you do need to write a new model, then you will need to know more about the internals of the apop_model, so see the sections on writing your own methods and settings groups below.
Models that ship with ApopheniaThis is a partial list---the full list is on the Models page.
Distributions
GLM family
More
Basic object methods
Methods for computation
Writing your ownAs above, users are encouraged to always interrogate models via the helper functions, like apop_estimate or apop_cdf. The helper functions do some boilerplate error checking, and are where the defaults are called: if your model has a log_likelihood method but no p method, then apop_p will use exp(log_likelihood). If you don't give an estimate method, then apop_estimate will call apop_maximum_likelihood.
So the game in writing a new model is to write just enough internal methods to give the helper functions what they need. In the not-uncommon best case, all you need to do is write a log likelihood function, and the model framework described to this point gets filled in via defaults.
Here is how one would set up a model that could be estimated using maximum likelihood:
double apop_new_log_likelihood(apop_data *data, apop_model *m)
data is the input data, and m is the parametrized model (i.e. your model with a parameters element set by the caller). This function will return the value of the log likelihood function at the given parameters.
Setting constraints on how to set them. Otherwise, no constraints will be assumed.apop_model your_new_model = {"The Me distribution", .vbase=n0, .mbase1=n1, .mbase2=n2, .dbase=nd, .estimate = new_estimate, .log_likelihood = new_log_likelihood };
The parameter counts are needed for procedures that auto-allocate the parameters; for the above example, they will call newest->parameters = apop_data_alloc(n0, n1, n2). The dbase is the size of a data element, and is needed to let those methods that make random draws internally know how much space to allocate. It's common to have [(the number of columns in your data set) -1] parameters; this count will be filled in if you specify -1 for vbase or mbase(1|2). If the allocation is exceptional in a different way, then you will need to allocate parameters via a prep method.
You already have enough that something like
apop_model *estimated = apop_mle(your_data, your_new_model);
will work.
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){ //some algebra here to find df/dp0, df/dp1, df/dp2.... gsl_vector_set(gradient,0, d_0); gsl_vector_set(gradient,1, d_1); }
The apop_model is mostly functions, with few standard-across-models data elements. If your functions carry around specific data, then you will have to write a companion structure and read the instructions on initializing/copying/freeing settings groups below.
Model settings[For info on specific settings groups and their contents and use, see the Settings page.]
Intro for model usersDescribing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of slots to be filled with data.
So, the apop_model includes a single list which can hold an arbitrary number of groups of settings, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.
Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all.
Here is an example where a settings group is worth tweaking: the apop_parts_wanted_settings group indicates which parts of the auxiliary data you want.
1 apop_model *m = apop_model_copy(apop_ols); 2 Apop_model_add_group(m, apop_parts_wanted, .covariance='y'); 3 apop_model *est = apop_estimate(data, *m);
Line one establishes the baseline form of the model. Line two adds a settings group of type apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.
Notice that we don't need the _settings ending to the settings group's name---macros make it happen. Apop_model_add_group follows the Designated initializers syntax of the form .setting=value.
Settings groups are copied with the model, which facilitates chaining estimations. Continuing the above example, you could re-estimate to get the predicted values and covariance via:
Apop_settings_set(est, apop_parts_wanted, predicted, 'y'); apop_model *est2 = apop_estimate(data, *est);
gdb that will help you pull a settings group out of a model for your inspection; it shouldn't be too difficult to modify this macro for other debuggers.define get_group
set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )
p *$group
end
document get_group
Gets a settings group from a model.
Give the model name and the name of the group, like
get_group my_model apop_mle
and I will set a gdb variable named $group that points to that model, which you can use
like any other pointer. For example, print the contents with
p *$group
The contents of $group are printed to the screen as visible output to this macro.
end
For just using a model, that's all of what you need to know.
Writing new settings groupsFirst, there's a lightweight method of storing sundry settings, so in many cases you can bypass all of the following. The apop_model structure has a void pointer named more which you can use to store extra information as needed. If more_size is larger than zero (i.e. you set it to your_model.more_size=sizeof(your_struct)), then it will be copied via memcpy by apop_model_copy, and freed by apop_model_free. Apophenia's estimation routines will never impinge on this item, so do what you feel with it.
The remainder of this section describes the information you'll have to provide to make use of the conveniences described to this point: initialization of defaults, smarter copying and freeing, and adding to an arbitrariy long list of settings groups attached to a model. You will need four items: the structure itself, plus init, copy, and free functions. This is the sort of boilerplate that will be familiar to users of object oriented languages in the style of C++ or Java, but it's really a list of arbitrarily-typed elements, which makes this feel more like LISP. [And being a reimplementation of an existing feature of LISP, this section will be macro-heavy.]
ysg_settings (Your Settings Group), with a dataset, its two sizes, and an owner-of-data marker: typedef struct { int size1, size2, owner; apop_data *dataset; } ysg_settings; Apop_settings_declarations(ysg)
The structure itself gets the full name, ysg_settings. Everything else is a macro, and so you need only specify ysg, and the _settings part is implied. Because of these macros, your struct name must end in _settings.
If you have an especially simple structure, then you can declare the three functions with these three macros in your code:
Apop_settings_init(ysg, ) Apop_settings_copy(ysg, ) Apop_settings_free(ysg, )
These macros generate appropriate functions to do what you'd expect: allocating the main structure, copying one struct to another, freeing the main structure. The spaces after the commas indicate that no special code gets added to the functions that these macros generate.
Now that initializing/copying/freeing of the structure itself is handled, the remainder of this section will be about how to add instructions for the struture internals, like data that is pointed to by the structure elements.
You'll never call these funtions directly; they are called by Apop_model_add_group, apop_model_free, and other model or settings-group handling functions.
NULL. In most cases, though, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too. These macros will define for your use a structure named in, and an output pointer-to-struct named out. Continuing the above example: Apop_settings_init (ysg, Apop_assert(in.size1, "I need you to give me a value for size1. Stopping."); Apop_varad_set(size2, 10); Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2)); Apop_varad_set(owner, 1); )
Now, Apop_settings_add(a_model, ysg, .size1=100) would set up a group with a 100-by-10 data set, and set the owner bit to one.
//alternative one: mark that the copy doesn't own the data. //From the defaults, out->dataset is a copy of the in->dataset pointer. Apop_settings_copy (ysg, out->owner = 0; ) //alternative two: just copy the data. //owner will be 1, a value copied from the original. Apop_settings_copy (ysg, out->dataset = apop_data_copy(in->dataset); )
in for your use. Continuing the example (for either copying alternative): Apop_settings_free (ysg, if (in->owner) free(in->dataset); )
With those three macros in place, Apophenia will treat your settings group like any other.
Filtering & updatingIt's easy to generate new models that are variants of prior models. Bayesian updating, for example, takes in one apop_model that we call the prior, one apop_model that we call a likelihood, and outputs an apop_model that we call the posterior.
Tests & diagnosticsJust about any hypothesis test consists of a few common steps:
If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated
test is probably thrown in.
Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like apop_test_anova_independence and apop_test_kolmogorov, to produce the statistic and look up its significance level.
If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.
See also the in-database tests above.
Monte Carlo methodsTo give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:
// Input: any old vector. Output: 1 - the p-value for a chi-squared // test to answer the question, "with what confidence can I reject the // hypothesis that the variance of my data is zero?" double apop_test_chi_squared_var_not_zero(const gsl_vector *in){ Apop_assert_c(in, 0, 0, "input vector is NULL. Doing nothing."); gsl_vector *normed; double sum=0; apop_vector_normalize((gsl_vector *)in,&normed, 1); gsl_vector_mul(normed,normed); for(size_t i=0;i< normed->size; sum +=gsl_vector_get(normed,i++)); gsl_vector_free(normed); return gsl_cdf_chisq_P(sum,in->size); }
Or, consider the Rao statistic,
where
is your model's likelihood function and
its information matrix. In code:
apop_data * infoinv = apop_model_numerical_covariance(data, your_model); apop_data * score; apop_score(data, score->vector, your_model); apop_data * stat = apop_dot(apop_dot(score, infoinv), score);
Given the correct assumptions, this is
, where
is the dimension of
, so the odds of a Type I error given the model is:
double p_value = apop_test(stat, "chi squared", beta->size);
A data set can be read as an empirical model, where each row of the data set has some odds of being drawn. The apop_pmf model wraps a apop_data set so that it can be used as a model in this manner. In fact, when you call apop_estimate(your_data, apop_pmf), all that really happens is that the data set is attached to the returned copy of the model (plus some minor details).
You might want to clean up the data before turning it into a PMF. For example...
apop_data_pmf_compress(your_data); apop_data_sort(your_data); apop_vector_normalize(your_data->weights); apop_model *a_pmf = apop_estimate(your_data, apop_pmf);
...would remove all duplicates, and set the weights column to reflect the original number of copies for each observation, then sort the data, then turn it into a PMF. Compression produces a corresponding improvement in efficiency when calculating CDFs or making draws. Sorting is not necessary for making draws or getting a likelihood or log likelihood.
It is the weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no weights segment, then I assume that all rows have equal weight. So, for example, if you find a need to normalize the data into a sums-to-one PMF, then you can use apop_vector_normalize on the weights vector to implement that. [But notice that even the PMF's draw function doesn't need this.]
Using apop_data_pmf_compress puts the data into one bin for each unique value in the data set. You may instead want bins of fixed with, in the style of a histogram. A binspec has as many columns as the data set being binned, and has one or two rows. The first row gives a bin width for the given column, so each column may have a different bin width. The second row, if present, gives the offset from zero for the bins. All bins, both above and below zero, are shifted accordingly. If there is no second row, the offset is zero.
Send this binspec to apop_data_to_bins. If you send a NULL binspec, then the offset is zero and the bin size is big enough to ensure that there are
bins from minimum to maximum. There are other preferred formulae for bin widths that minimize MSE, which might be added as a future extension. The binspec will be added as a page to the data set, named "<binspec>"
There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.
For example, if you do not have a prior binspec in mind, then you can use the one generaged by the first call to the histogram binning function to make sure that the second data set is in sync:
apop_data_to_bins(first_set, NULL); apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));
If you have a model and data, use the apop_histogram_model_reset convenience function to make random draws and bin them appropriately. Once you have two distributions so formed, you can use apop_test_kolmogorov or apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.
Kernel density estimation will produce a smoothed PDF. See apop_kernel_density for details. Or, use apop_histogram_moving_average for a simpler smoothing method.
GSL histogramsThe GSL provides a gsl_histogram structure, that produces a PDF by accumulating data into bins. Apophenia wraps this into its apop_model struct, so that the model-family machinery can be applied to Bayesian updating, kernel smoothing, or other methods that could take in a PDF. To produce a GSL histogram-based PMF from your_data, use apop_estimate(your_data, apop_histogram).
The GSL histogram is one-dimensional. The GSL also provides a two-dimensional version, but this is still short of the
dimensions of the typical data set. Also, especially for larger dimensions, we need a sparse representation, while the GSL histogram structure choses the speed of a complete representation over the scalability of a sparse representation. For these and a few technical reasons, the apop_histogram model is considered to be deprecated---for most purposes, you are better off using the apop_pmf, which is at this point better supported.
As an aside, the GSL provides a closely-related second structure that incrementally sums up the PMF's bins to produce a CMF. The CMF can be used to map from a draw from a Uniform[0,1] to a draw from the PMF. Because it can be used to draw from the PMF, the GSL calls this the gsl_histogram_pdf structure. That's right: the data in the gsl_histogram_pdf structure is a cumulative sum---a CMF. This is handled internally by the apop_histogram, but you can retrieve it if need be.
As with the apop_pmf, if you want to compare this histogram to other data (observed or theoretical), then you'll need to produce a second synced histogram. You can use apop_histogram_model_reset to make random draws from a model and bin them. Producing a second histogram with matching bins is not as easy as just re-calling apop_data_to_bins with the same binspec, so there is the apop_histogram_vector_reset function to produce a second synced apop_histogram.
Here are some functions to deal with both the apop_pmf and apop_histogram (most handle both). See also the GSL documentation, linked at the bottom of this outline, for other uses of the GSL's histogram structure.
statistic
Maximum likelihood methodsThis section includes some notes on the maximum likelihood routine, but if you read the section on writing models, then you already know how to do maximum likelihood on exotic setups. Just write a model that has a p or log_likelihood function, and call apop_estimate(your_data,your_model). The default estimation routine is maximum likelihood.
If you are a not a statistician, then there are a few things you will need to keep in mind:
NULL data, and will be optimizing the parameters internal to the model.
, where the scaling factor
is fixed ahead of time, say at 100. typedef struct { double scaling; } coeff_struct; double banana (double *params, coeff_struct *in){ return (gsl_pow_2(1-params[0]) + in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0]))); }
double ll (apop_data *d, apop_model *in){ return - banana(in->parameters->vector->data, in->more); } int main(){ coeff_struct co = {.scaling=100}; apop_model b = {"Bananas!", .log_likelihood= ll, .vbase=2, .more = &co, .more_size=sizeof(coeff_struct)};
.vbase=2 specified that your parameters are a vector of size two, which means that in->parameters->vector->data is the list of doubles that you should send to banana. The more element of the structure is designed to hold any arbitrary structure; if you use it, you will also need to use the more_size element, as above. Notice that main in this example makes heavy use of standard C designated initializers, because they are wonderful; you may need to tell your compiler to make use of them, such as using the --std=gnu99 flag on the gcc command line..p or the .log_likelihood element of your model, and there is really only one difference: if you give me a .p function, I may take its log somewhere along the way. Thus, the value of a .p function must always be non-negative.Apop_model_add_group(your_model, apop_parts_wanted);
#include <apop.h> typedef struct { double scaling; } coeff_struct; double banana (double *params, coeff_struct *in){ return (gsl_pow_2(1-params[0]) + in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0]))); } double ll (apop_data *d, apop_model *in){ return - banana(in->parameters->vector->data, in->more); } int main(){ coeff_struct co = {.scaling=100}; apop_model b = {"¡Bananas!", .log_likelihood= ll, .vbase=2, .more = &co, .more_size=sizeof(coeff_struct)}; Apop_model_add_group(&b, apop_mle, .verbose='y', .method=APOP_SIMPLEX_NM); Apop_model_add_group(&b, apop_parts_wanted); apop_model *e1=apop_estimate(NULL, b); apop_model_print(e1); Apop_settings_set(&b, apop_mle, method, APOP_CG_BFGS); apop_model *e2=apop_estimate(NULL, b); apop_model_print(e2); gsl_vector *one = apop_vector_fill(gsl_vector_alloc(2), 1, 1); assert(apop_vector_distance(e1->parameters->vector, one) < 1e-2); assert(apop_vector_distance(e2->parameters->vector, one) < 1e-2); }
The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.
The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.
If you give it a likelihood function with no regard to constraints plus an array of constraints, apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.
A constraint function must do three things:
The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero:
static double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){ //constraint is 0 < beta_2 static apop_data *constraint = NULL; if (!constraint){ constraint= apop_data_calloc(1,1,1); apop_data_set(constraint, 0, 0, 1); } return apop_linear_constraint(v->parameters->vector, constraint, 1e-3); }
Missing data
Legible outputMost of the printing options are intended to operate in different modes (as appropriate). The idea is that you will probably working in one specific mode for a while---writing to screen, a file, or the database---and so can just set one global option reflecting this:
apop_opts.output_type = 's'; //Stdout apop_opts.output_type = 'f'; //named file apop_opts.output_type = 'p'; //a pipe or already-opened file apop_opts.output_type = 'd'; //the database
C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice:
apop_opts.output_type = 'p'; apop_opts.output_pipe = popen("gnuplot", "w"); apop_plot_lattice(...); fclose(apop_opts.output_pipe); apop_opts.output_pipe = fopen("newfile", "w"); apop_data_print(set1); fprintf(apop_opts.output_pipe, "\nNow set 2:\n"); apop_data_print(set2);
Continuing the example, you can always override the global data with a specific request:
apop_vector_print(v, "vectorfile"); //put vectors in a separate file apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db apop_matrix_print(m, .outpipe = stdout); //now show the same matrix on screen
I will first look to the input file name, then the input pipe, then the global output_pipe, in that order, to determine to where I should write. Some combinations (like output type = 'd' and only a pipe) don't make sense, and I'll try to warn you about those.
What if you have too much output and would like to use a pager, like less or more? In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is the form:
FILE *lesspipe = popen("less", "w"); assert(lesspipe); apop_data_print(your_data_set, .output_pipe=lesspipe); pclose(lesspipe);
popen will search your usual program path for less, so you don't have to give a full path.
The plot functions produce output for Gnuplot (so output type = 'd' again does not make sense). As above, you can pipe directly to Gnuplot or write to a file.
AssortedSome more descriptive methods:
General utilities:
Math utilities:
DeprecatedThese functions will probably disappear or be replaced soon.
.output_type='d' Apop_settings_add_group Use Apop_model_add_group apop_settings_init macro
Further referencesFor your convenience, here are links to some other libraries you are probably using.