![]() |
|
Expand all | Collapse all
Overview and PreliminariesAs well as the information in this outline, there is a separate page covering the details of setting up a computing environment and another page with some sample code for your perusal.
A quick overviewA typical data analysis project using Apophenia's tools would go something like this:
apop_estimate (data_set, apop_OLS)
apop_estimate (data_set, apop_probit)
Over the course of this, our researcher had to parse text, use a database, manipulate matrices, fit statistical models, and then test the results by looking up a summary statistic in a table of distributions. Having hooks to all of these things in one library makes the flow much more pleasant.
For a concrete example of folding Apophenia into a project, have a look at this sample program.
Of course, you may not need all this: you might just want a neat interface to SQLite, or an easy-to-use structure that holds your data in a matrix and allows you to label the rows and columns. Have a look at the library outline for the various categories of relatively standalone functions that you can fold into your own work.
If you would like a more in-depth discussion of Apophenia's raison d'etre and logic, have a look at these design notes.
The typical stats package workflow v Apophenia's workflowFrom the quick overview section above, you saw that (1) the library expects you to keep your data in a database, pulling out the data as needed, and (2) that a great deal of statistical information, and the workflow, is encapsulated in apop_model structures.
Starting with (2), some stats packages have model structures that typically work very differently from those here. The key conceptual difference is in breadth, which implies a practical difference in how data is handled.
If a stats package has something called a model, then it is probably of the form Y = [a function of X], such as
. Trying new models means trying different functional forms for the right-hand side, such as including
in some cases and excluding it in others. Conversely, Apophenia works to facilitate trying new models in the sense of switching out a linear model for a hierarchical, or a Bayesian model for a simulation. A formula syntax makes little sense over such a broad range of models.
As a result, Apophenia does not put the form of the left-hand side into the model; the data is assumed to be correctly formatted, scaled, or logged before being passed to the model. This is where part (1), the database, comes in.
Many stats packages were initially designed around data literally being provided in decks of punchchards, and still seem to assume that the data is basically to be taken as read-only. With a database, the GSL, and the various tools provided by Apophenia, you have more than enough tools to make your data look how you want it to. When you do want to try swapping out different formula specifications, the database provides a proxy for the sort of formula specification language above:
apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3,2) from data"); apop_model *est = apop_estimate(testme, apop_ols);
[As an aside, there is a simple principle to the ordering of arguments to functions like apop_estimate : the data always comes first.]
Generating factors and dummies is also considered data prep, not model internals. See apop_data_to_dummies and apop_text_to_factors.
Now that you have an estimated model (which holds the estimated parameters), you can interrogate it.
//If you have a new data set, you can fill in predicted values: apop_predict(new_data_set, est); apop_data_show(new_data_set) //Make random draws. The draw function needs an RNG from //the GNU Scientific Library, and a pointer-to-\c double. gsl_rng *r = apop_rng_alloc(218); for (int i=0; i< matrix->size1; i++){ Apop_matrix_col(matrix, i, one_col); apop_draw(one_col->data, r, est); }
So that's the intended workflow: first, gather the data, probably via the database, pull the data in the format you want it; then, set up your model, estimate and interrogate it. [See also the model settings section below on optional details of model setup.]
Some notes on C and Apophenia's use of C utilities.
Learning CModeling with Data has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are in abundance. Some people find pointers to be especially difficult; fortunately, there's a claymation cartoon which clarifies everything.
Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.
Usage notesHere are some notes about the technical details of using the Apophenia library in your development environment.
Header aggregation
If you put
#include <apop.h>
at the top of your file, then it will call virtually every header file you could need: gsl_matrix.h, gsl_blas.h, sqlite3.h, stdio.h, string.h, math.h, apophenia_all_of_them.h, et cetera. Of course, if you get `implicit declaration of...' then you will need to manually include something else. Bear in mind that every book on C will tell you this is bad form and you shouldn't do it.
Easier calling syntax
Several functions (but nowhere near all) use a script-like syntax for calling functions. For example, the apop_vector_distance function finds the distance between two vectors using various metrics. The full function call, using an
norm for vectors
and
would be
apop_vector_distance(v1, v2, 'L', 3);
But some metrics don't need a number to be fully described, so you can leave that off. Here's the standard (Euclidean) distance:
apop_vector_distance(v1, v2, 'E');
Because Euclidean distance is so standard, it is assumed as the default if you don't specify a norm, so you can leave out the third input:
apop_vector_distance(v1, v2);
If you give only one vector, I'll assume the other is the zero vector, and thus return the length of the first vector. Here is the length of a vector using the Manhattan metric:
apop_vector_distance(v1, .metric='M');
See the Designated initializers page for details of this syntax.
LibrariesYour best bet is to write yourself a Makefile. If you don't want to use the sample Makefile, then here are some notes for the command line. When compiling, you will need to tell the compiler to use the Apophenia library. That is, you will need to call GCC with gcc -lapophenia (as well as the other usual flags). For example, gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g will produce an output file named run_me from the input source code sample.c. It will include symbols for debugging (-g) and will correctly find the functions in the Apophenia, GSL, and SQLite libraries (-lapophenia -lgsl ...). Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, gcc sample.c -lapophenia will work, while gcc -lapophenia sample.c is likely to give you errors. Similarly, list -lapophenia before -lgsl, which comes before -lgslcblas.
DebuggingThe global variable apop_opts.verbose turns on some diagnostics, such as printing the query sent to the database engine (which is useful if you are substituting in many %ses). Just set apop_opts.verbose =1 when you want feedback and apop_opts.verbose=0 when you don't.
If you use gdb, you can define macros to use the pretty-printing functions on your data, which can be a significant help. Add these to your .gdbinit:
define pv
p apop_vector_show($arg0)
end
define pm
p apop_matrix_show($arg0)
end
define pd
p apop_data_show($arg0)
end
define pa
p *($arg0)@$arg1
end
Then just pd mydata to see all components of your data set neatly displayed, or pa myarray 5 to see the first five elements of your array.
Syntax highlightingIf your text editor supports syntax highlighting, there are a few types defined in the Apophenia and GSL headers which may be worth coloring. E.g., for vim, add the following two lines to /usr/share/vim/syntax/c.vim:
syn keyword cType gsl_matrix gsl_rng gsl_vector apop_data syn keyword cType apop_name apop_model
Other text editors have similar files to which you can add the above types.
The Python interfaceThe distribution includes a Python interface via the SWIG tool for bridging across languages.
Installation: You will need to have the SWIG and Python-development packages installed on your computer when compiling. From there,
./configure --enable-python make sudo make install
will produce the Python library and install it in Python's site-packages directory.
Sample script:
On the page on the apop_ols model, you will find a few lines of toy data and a sample program to run an OLS regression. Once you have set up the data file, you can use this Python rewrite of the OLS program:
from apop import * apop_text_to_db("data", "d", 0, 1, None) data = apop_query_to_data("select * from d") est = apop_estimate(data, avar.apop_ols) print est #And one last example of using a method of the apop_data object: print est.parameters.transpose()
apop_ols variable had to be called avar.apop_ols.apop_vector_skew(my_v), you would call my_v.skew(). If you want a list of methods, just use help(my_v) or any of the other familiar means of introspection.Here is another simple example, that copies a Python-side list into a matrix using apop_pylist_to_data, and then runs a Fisher Exact test on the matrix. If the list were one-dimensional (flat), then the data would be copied into the vector element of the returned apop_data structure.
from apop import * data = [[15, 13], [18, 19]] adata = apop_pylist_to_data(data) print "display the data" print adata.matrix result = apop_test_fisher_exact(adata) print result print "Or, just one value:" print result.get("p value", -1)
SQL, the syntax for querying databasesFor a reference, your best bet is the Structured Query Language reference for SQLite. For a tutorial; there is an abundance of tutorials online. The blog of Apophenia's author includes an entry about complementaries between SQL and matrix manipulation packages.
Apophenia currently supports two database engines: SQLite and mySQL. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.
You can switch to mySQL two ways: set apop_opts.db_engine = 'm', or set the environment variable APOP_DB_ENGINE=mysql. Otherwise, the system will use SQLite. Ideally, after you make this switch, you need make no other changes--- apop_query, apop_query_to_data, apop_table_exists, et cetera, will work as before.
Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see Database moments (plus pow()!).
The book versionApophenia co-evolved with Modeling with Data: Tools and Techniques for Statistical Computing. You can read about the book, or download a free PDF copy of the full text, at modelingwithdata.org.
If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.
As with many computer programs, the preferred manner of citing Apophenia is to cite its related book. Here is a BibTeX-formatted entry, which should be be easy to re-shape for other environments:
@book{klemens:modeling,
title = "Modeling with Data: Tools and Techniques for Statistical Computing",
author="Ben Klemens",
year=2008,
publisher="Princeton University Press"
}
Data setsThe apop_data structure represents a data set. It joins together a gsl_vector, a gsl_matrix, an apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a gsl_matrix or a gsl_vector.
For example, let us say that you are running a regression: there is a vector for the one dependent variable, and a matrix for the several independent variables. Think of them as a partitioned matrix, where the vector is column -1, and the first column of the matrix is column zero. Here is some code to print the entire matrix. Notice that the column counter i starts counting at -1.
for (j = 0; j< data->matrix->size1; j++){ printf("%s\t", apop_name_get(data->names, j, 'r')); for (i = -1; i< data->matrix->size2; i++) printf("%g\t", apop_data_get(data, j, i)); printf("\n"); }
We're generally assuming that the data vector and data matrix have the same row count: data->vector->size==data->matrix->size1 . This means that the apop_name structure doesn't have separate vector_names and row_names elements: the rownames are assumed to apply for both.
The apop_data set includes a more pointer, which will typically be NULL, but may point to another apop_data set. This is intended for a main data set and a second or third page with auxiliary information: estimated parameters on the front page and their covariance matrix on page two, or predicted data on the front page and a set of prediction intervals on page two. apop_data_copy and apop_data_free will handle all the pages of information. The more pointer is not intended as a linked list for millions of data points---you can probably find a way to restructure your data to use a single table (perhaps via apop_data_pack and apop_data_unpack).
Easy data manipulation is essential for enjoying life as a researcher. Thus, there are a great many functions to collate, copy, merge, sort, prune, and otherwise manipulate the apop_data structure and its components.
Note: Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate the GSL's documentation here. You will find a link to the full GSL documentation at the end of this outline, and the text of the outline will list the names a few common functions. The GSL's naming scheme is very consistent, so a simple reminder of the function name may be sufficient for your needs.
gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j) gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j) gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j) gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src) gsl_matrix_transpose (gsl_matrix * m) : square matrices only gsl_matrix_set_all (gsl_matrix * m, double x) gsl_matrix_set_zero (gsl_matrix * m) gsl_matrix_set_identity (gsl_matrix * m) void gsl_vector_set_all (gsl_vector * v, double x) void gsl_vector_set_zero (gsl_vector * v) int gsl_vector_set_basis (gsl_vector * v, size_t i): set all elements to zero, but set item
to one. gsl_vector_reverse (gsl_vector * v): reverse the order of your vector's elements
Alloc/freeSee also:
gsl_vector * gsl_vector_alloc (size_t n) gsl_vector * gsl_vector_calloc (size_t n) void gsl_vector_free (gsl_vector * v) gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)
Using GSL ViewsYou may need to pull a row of a matrix as a distinct vector. The GSL makes it easy to do so, and Apophenia provides several convenience functions to save any remaining hassle. The data in the new vector will point to the base data, not copy it. Some examples:
apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table"); gsl_matrix *m = d->matrix; //Get a column using its name Apop_col_t(d, "obs1", ov); double obs1_sum = apop_vector_sum(ov); //Get a row using its index Apop_row(d, 0, v); double first_row_sum = apop_vector_sum(v); //First column's sum, matrix format Apop_matrix_col(m, 0, mv); double first_col_sum = apop_vector_sum(mv); //Pull a ten by five submatrix, whose first element is the (2,3)rd //element of the parent data set's matrix Apop_submatrix(d, 2,3, 10,5, subm); double first_col_sum = apop_matrix_sum(subm);
These macros make use of a set of GSL matrices that produce output of type gsl_matrix_view and gsl_vector_view. These types point to the source data, and add metadata to turn the data into a coherent matrix/vector. Apophenia's macros generate these views, then pull the matrix/vector from them, so you never have to deal with the _view structs directly.
If the macros don't work for you, you can use the GSL's macros directly. Here's how to get the fifth row of a_matrix into a vector view:
gsl_vector_view v; v = gsl_matrix_col(a_matrix, 4);
For rows, use gsl_matrix_row(a_matrix, n). The vector view is a data structure which includes an element of type gsl_vector named vector; this is the only element you will be interested in. The expression &(v.vector) is of type gsl_vector *, and therefore can be used as you would any other pointer to a gsl_vector. For example, try apop_mean(&(v.vector));.
The view is intended to be a common variable, not a pointer. If you want to retain the data after the function exits, copy it to another vector:
gsl_vector_view v; gsl_vector *a_new_vector = gsl_vector_alloc(a_matrix->size1); v = gsl_matrix_col(a_matrix, 4); gsl_vector_memcpy(a_new_vector, &(v.vector));
Set/getLet
be text, and
be an index. Then you can get or set an element by row title and column index, row index and column title, et cetera. The functions that have no suffix find an element by row and column index.
The apop_data_ptr_... functions return a pointer to the element.
See also:
double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j) double gsl_vector_get (const gsl_vector * v, size_t i) void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x) void gsl_vector_set (gsl_vector * v, size_t i, double x) double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j) double * gsl_vector_ptr (gsl_vector * v, size_t i) const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j) const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i) gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i) gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j) gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v) gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)
Map/apply Map/apply These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The ..._sum functions return the sum of the mapped output.
There is an older and a newer set of functions. The older versions, which act on gsl_matrixes or gsl_vectors have more verbose names; the newer versions, which act on the elements of an apop_data set, use the Designated initializers syntax to ad a few options and a more brief syntax.
You can do many things quickly with these functions.
Get the sum of squares of a vector's elements:
double sum_of_squares = apop_map_sum(dataset, gsl_pow_2); //given <tt> apop_data *dataset</tt> double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2); //given <tt> gsl_vector *v</tt>
Here, we create an index vector [0, 1, 2,
].
double index(double in, int index){return index;} apop_data *d = apop_data_alloc(100, 0, 0); apop_map(d, .fn_di=index, .inplace=1);
Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:
static double your_log_likelihood_fn(const gsl_vector * in){[your math goes here]} double total_log_likelihood = apop_matrix_map_sum(dataset, your_log_likelihood_fn);
How many missing elements are there in your data matrix?
static double nan_check(const double in){ return isnan(in);} int nan_count_data = apop_map_sum(in, nan_check, .part='m');
Get the mean of the not-NaN elements of a data set:
static double no_nan_val(const double in){ return isnan(in)? 0 : in;} static double nan_check(const double in){ return isnan(in);} static double apop_mean_no_nans(apop_data *in){ return apop_map_sum(in, no_nan_val)/apop_map_sum(in, nan_check); }
This program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the
statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.
Notice how the older apop_vector_apply uses file-global variables to pass information into the functions, while the apop_map uses a pointer to the constant parameters to input to the functions.
#include <apop.h> double row_offset; gsl_rng *r; void offset_rng(double *v){*v = gsl_rng_uniform(r) + row_offset;} double find_tstat(gsl_vector *in){ return apop_mean(in)/sqrt(apop_var(in));} double conf(double in, void *df){ return gsl_cdf_tdist_P(in, *(int *)df);} int main(){ apop_data *d = apop_data_alloc(0, 10, 100); r = apop_rng_alloc(3242); for (int i=0; i< 10; i++){ row_offset = gsl_rng_uniform(r)*2 -1; Apop_row(d, i, onerow); apop_vector_apply(onerow, offset_rng); } size_t df = d->matrix->size2-1; apop_data *means = apop_map(d, .fn_v = apop_vector_mean, .part ='r'); apop_data *tstats = apop_map(d, .fn_v = find_tstat, .part ='r'); apop_data *confidences = apop_map(tstats, .fn_dp = conf, .param = &df); printf("means:\t\t"); apop_data_show(means); printf("t stats:\t"); apop_data_show(tstats); printf("confidences:\t"); apop_data_show(confidences); }
Basic MathSee also:
int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b) int gsl_matrix_scale (gsl_matrix * a, const double x) int gsl_matrix_add_constant (gsl_matrix * a, const double x) gsl_vector_add (gsl_vector * a, const gsl_vector * b) gsl_vector_sub (gsl_vector * a, const gsl_vector * b) gsl_vector_mul (gsl_vector * a, const gsl_vector * b) gsl_vector_div (gsl_vector * a, const gsl_vector * b) gsl_vector_scale (gsl_vector * a, const double x) gsl_vector_add_constant (gsl_vector * a, const double x)
Matrix math
matrix, matrix
vector, or vector
matrix
Summary statsSee also:
double gsl_matrix_max (const gsl_matrix * m) double gsl_matrix_min (const gsl_matrix * m) void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out) void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax) void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin) void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax) gsl_vector_max (const gsl_vector * v) gsl_vector_min (const gsl_vector * v) gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out) gsl_vector_max_index (const gsl_vector * v) gsl_vector_min_index (const gsl_vector * v) gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)
Moments
Conversion among types
Name handling
Generating factorsFactor is jargon for a numbered category. Number-crunching programs work best on numbers, so we need a function to produce a one-to-one mapping from text categories into numeric factors.
A dummy is a variable that is either one or zero, depending on membership in a given group. Some methods (typically when the variable is an input or independent variable) prefer dummies; some methods (typically for outcome or dependent variables) prefer factors.
DatabasesThese are convenience functions to handle interaction with SQLite or mySQL. They open one and only one database, and handle most of the interaction therewith for you.
You will probably first use apop_text_to_db to pull data into the database, then apop_query to clean the data in the database, and finally apop_query_to_data to pull some subset of the data out for analysis.
See also the Conversion functions, including apop_text_to_db and apop_matrix_to_db.
Out
InSee the print functions below. By setting apop_opts.output_type = 'd', apop_data sets (or gsl_matrixes and gsl_vectors) are `printed' to the database.
Models
IntroductionA model intermediates between data, parameters, and likelihoods. [This definition readily includes "non-parametric" models.] When the parameters are fully specified, the model will give you a likelihood that an observation will occur. Much of statistical analysis consists of writing down a model, estimating its parameters, and running hypothesis tests to determine the confidence with which we can make statements about those parameters.
Given data, the model can estimate the parameters that best fit the data. This is the typical model estimation.
Given parameters, the model can generate the expected value of the data, or randomly generate a full artificial data set.
Apophenia facilitates all this via its apop_model objects. The model includes slots for log_likelihood, estimate, draw, expected_value, and other operations we expect from our models.
For example, a model may be a probability distribution, such as the apop_normal, apop_poisson, or apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: apop_estimate(data,apop_normal).
The design of the objects hopes to make it as easy as possible for you, dear reader, to write new models. For the most part, all you need to do is write a log likelihood function, and apop_maximum_likelihood does the rest; see below.
The most commonly-used function in the systems below is the apop_estimate function. It takes in a model and data, and outputs an apop_estimate, that includes the parameter estimates and the various auxiliary data that one may need to test the estimates, such as the variance-covariance matrix. For most users, the estimate function will be all one needs from a model. Just prep the data, select a model, and produce an estimate:
apop_data *data = read_in_data(); apop_model *the_estimate = apop_estimate(data, apop_probit); apop_model_show(the_estimate);
The internals
The apop_model struct breaks down into three parts:
estimate method. In the list at the top of this page, all of the functions whose last argument is a model are helper functions of this type. The helper functions do some boilerplate error checking, and mean that you don't have to fill in every blank in your model: if you have a log_likelihood method but no p method, then apop_p will use exp(log_likelihood). If you don't give an estimate method, then apop_estimate will use maximum likelihood.
Writing your ownWriting apop_model objects is easy, because (almost) everything has a default of some sort, so you can fill in those segments you know, and let the system use computationally-intensive methods for the rest.
For example, here is how one would set up a model for an MLE:
double apop_new_log_likelihood(apop_data *data, apop_model *m)
data is the input data, and m is the parametrized model (i.e. your model with a parameters element set by the caller). This function will return the value of the log likelihood function at the given parameters.apop_model your_new_model = {"The Me distribution", number_of_parameters, .estimate = new_estimate, .log_likelihood = new_log_likelihood };
number_of_parameters is probably a positive integer like 2, but it is often (the number of columns in your data set) -1, in which case, set number_of_parameters to -1.
You already have enough that something like
apop_model *estimated = apop_mle(your_data, your_new_model);
will work.
For many methods, the first thing you will write is the random number generator. Other elements, such as the gradient---
void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){ //some algebra here to find df/dp0, df/dp1, df/dp2.... gsl_vector_set(gradient,0, d_0); gsl_vector_set(gradient,1, d_1); }
---or the expected value are optional.
Models that ship with ApopheniaA partial list---the full list is on the Models page.
Distributions
GLM family
More
Basic object methods
Methods for computation
Model settings[For info on specific settings groups and their contents and use, see the Settings page.]
Intro for model usersDescribing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of slots to be filled with data.
So, the apop_model includes a single list, whose name is simply settings, which can hold an arbitrary number of groups of settings. For example, you can have a set of search specifications for finding the maximum likelihood, and a histogram for making random draws.
Settings groups are automatically initialized with default values when needed. If the defaults do no harm, then you don't need to think about these settings groups at all.
If you do need to tweak a setting, you will need its location. Think of each model having a row of baskets, such as the apop_mle_settings and the apop_histogram_settings baskets. To find a single setting, like the MLE's tolerance setting, you would need to give the model, which basket to look in, and the setting. E.g.,
double tol = Apop_settings_get(your_model, apop_mle, tolerance); Apop_settings_set(your_model, apop_mle, tolerance, 1e-5);
Notice that we don't need the _settings ending to the settings group's name---macros make it happen.
But you are probably going to set all of a model's settings at once when you first use it. Here is a full example:
1 apop_data *data = your_data_here; 2 apop_data *w = your_weights_here; 3 apop_model *m = apop_model_copy(apop_wls); 4 Apop_model_add_group(m, apop_ls, .weights=w, .want_cov='y'); 5 apop_model *est = apop_estimate(data, *m);
Line three establishes the baseline form of the model. Line four adds a settings group of type apop_lm_settings to the model, and specifies that we want it initialized with the weights element set to the data set w, and the want_cov element set to 'y'. Unlike the single-setting macros above, Apop_model_add_group follows the Designated initializers syntax of the form .setting=value.
Having set the settings, line 5 does the weighted OLS.
Also, the output to any of Apophenia's estimations will have an appropriate group of settings allocated, so you can chain estimations pretty easily. Continuing the above example, you could re-estimate with an alternative set of weights via:
Apop_settings_set(est, apop_ls, weights, weight_set_two); apop_model *est2 = apop_estimate(data, *est);
Apop_settings_add_group and Apop_settings_alloc_add. They made sense at the time. That's why the Apop_model_add_group macro doesn't have settings in the name. The Apop_settings_add macro is equivalent to Apop_settings_set.For just using a model, that's about 100% of what you need to know.
Writing new settings groupsTo store the settings for your own models, you don't necessarily need any of this. The apop_model structure has a void pointer named more which you can use to store extra information as needed. If more_size is larger than zero (i.e. you set it to your_model.more_size=sizeof(your_struct)), then it will be copied via memcpy by apop_model_copy, and freed by apop_model_free. Apophenia's estimation routines will never impinge on this item, so do what you feel with it.
If you do want to set up a new settings group, then you will need four items. This is the sort of boilerplate that will be familiar to users of object oriented languages in the style of C++ or Java. Let your settings group be named ysg; then you will need
typedef struct { ... } ysg_settings
ysg_settings *ysg_settings_init(ysg_settings in){
ysg_settings *out = malloc(sizeof(ysg_settings));
apop_varad_setting(in, out, want_cov, 'y');
return out; }
want_cov), and if that element is not found in the input, sets it to the specified value (here, 'y').void *ysg_settings_copy(ysg_settings *copyme) { ysg_settings *out = malloc(sizeof(ysg_settings)); //copy elements here. If you have no pointers or other trickery, just use: *out = *copyme return out; }
void ysg_settings_free(ysg_settings *freeme) {
free(freeme);
}
The names are not negotiable: when you call
Apop_model_add_group(m, ysg)
the macro will look for ysg_settings, ysg_settings_init, et cetera.
The lines-of-code averse will cringe at having to write such boilerplate code (I do), but after spending a year resisting it, I have to concede that it's the least of all evils.
You can retrieve the whole group or individual elements via:
Apop_settings_get_group(m, ysg) //as well as the one-element macro mentioned above, Apop_settings_get(m, ysg, an_element)
As you saw above, once the typedef/alloc/copy/free machinery is written, you can declare, get, and set in a reasonably graceful manner.
owner element in your data, which is set to one in the init routine but to zero in the copy routine, then free pointers only if owner == 1. Here is an example that either takes in a data set or defaults to initializing a new one; either way, copies of the model will all point to the same data set.typedef struct { apop_data *dataset; int owner; } ysg_settings; ysg_settings *ysg_settings_init(ysg_settings in){ ysg_settings *out = malloc(sizeof(ysg_settings)); apop_varad_settings(in, out, dataset, apop_data_alloc(100, 100, 100)); out.owner = 1; return out; } void *ysg_settings_copy(ysg_settings *copyme) { ysg_settings *out = malloc(sizeof(ysg_settings)); *out = *copyme; //pointer to data was copied, not underlying data. out.owner = 0; return out; } void ysg_settings_free(ysg_settings *freeme) { if (freeme.owner) apop_data_free(freeme->dataset); free(freeme); }
Estimation aids
Tests & diagnosticsJust about any hypothesis test consists of a few common steps:
If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated
test is probably thrown in.
Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like apop_test_anova_independence and apop_test_kolmogorov, to produce the statistic and look up its significance level.
If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.
See also the in-database tests above.
Monte Carlo methodsTo give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:
// Input: any old vector. Output: 1 - the p-value for a chi-squared // test to answer the question, "with what confidence can I reject the // hypothesis that the variance of my data is zero?" double apop_test_chi_squared_var_not_zero(const gsl_vector *in){ apop_assert(in, 0, 0, 'c', "input vector is NULL. Doing nothing.\n"); gsl_vector *normed; double sum=0; apop_vector_normalize((gsl_vector *)in,&normed, 1); gsl_vector_mul(normed,normed); for(size_t i=0;i< normed->size; sum +=gsl_vector_get(normed,i++)); gsl_vector_free(normed); return gsl_cdf_chisq_P(sum,in->size); }
Or, consider the Rao statistic,
where
is your model's likelihood function and
its information matrix. In code:
apop_data * infoinv = apop_model_numerical_covariance(data, your_model); apop_data * score; apop_score(data, score->vector, your_model); apop_data * stat = apop_dot(apop_dot(score, infoinv), score);
Given the correct assumptions, this is
, where
is the dimension of
, so the odds of a Type I error given the model is:
double p_value = apop_test(stat, "chi squared", beta->size);
The GSL provides a gsl_histogram structure, that produces a PDF by accumulating data into bins. Apophenia wraps this into its apop_model struct, so that the model-family machinery can be applied to Bayesian updating, kernel smoothing, or other methods that output a PDF.
To produce a PMF from your_data, use apop_estimate(your_data, apop_histogram). If you would like to compare this histogram to other data (observed or theoretical),then you'll need to produce a second synced histogram using apop_histogram_vector_reset or apop_histogram_model_reset. Then you can send both histograms to, say, apop_test_kolmogorov.
The second structure from the GSL incrementally sums up the PMF's bins to produce a CMF. The CMF can be used to map from a draw from a Uniform[0,1] to a draw from the PMF. Because it can be used to draw from the PMF, the GSL calls this the gsl_histogram_pdf structure. That's right: the data in the gsl_histogram_pdf structure is a cumulative sum---a CMF.
Anyway, here are some functions to deal with these various histograms and such; see also the GSL documentation, linked at the bottom of this outline.
Maximum likelihood methodsIf you read the section on writing models, then you already know how to do maximum likelihood on exotic setups. Just write a model that has a p or log_likelihood function, and call apop_estimate(your_data,your_model). The default estimation routine is maximum likelihood.
MLEs have an especially large number of parameter tweaks that could be made; see the section on MLE settings above.
The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.
The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.
If you give it a likelihood function with no regard to constraints plus an array of constraints, apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.
A constraint function must do three things:
The idea is that if the constraint returns zero, the log likelihood function will return the log likelihood as usual, and if not, it will return the log likelihood at the constraint's return vector minus the penalty. To give a concrete example, here is a constraint function that will ensure that both parameters of a two-dimensional input are both greater than zero:
static double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){ //constraint is 0 < beta_2 static apop_data *constraint = NULL; if (!constraint){ constraint= apop_data_calloc(1,1,1); apop_data_set(constraint, 0, 0, 1); } return apop_linear_constraint(v->parameters->vector, constraint, 1e-3); }
More descriptive methods
Missing data
Legible outputMost of the printing options are intended to operate in different modes (as appropriate). The idea is that you will probably working in one specific mode for a while---writing to screen, a file, or the database---and so can just set one global option reflecting this:
apop_opts.output_type = 's'; //Stdout apop_opts.output_type = 'f'; //named file apop_opts.output_type = 'p'; //a pipe or already-opened file apop_opts.output_type = 'd'; //the database
C makes minimal distinction between pipes and files, so you can set a pipe or file as output and send all output there until further notice:
apop_opts.output_type = 'p'; apop_opts.output_pipe = popen("gnuplot", "w"); apop_plot_lattice(...); fclose(apop_opts.output_pipe); apop_opts.output_pipe = fopen("newfile", "w"); apop_data_print(set1); fprintf(apop_opts.output_pipe, "\nNow set 2:\n"); apop_data_print(set2);
Continuing the example, you can always override the global data with a specific request:
apop_vector_print(v, "vectorfile"); //put vectors in a separate file apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db apop_matrix_print(m, .outpipe = stdout); //now show the same matrix on screen
I will first look to the input file name, then the input pipe, then the global output_pipe, in that order, to determine to where I should write. Some combinations (like output type = 'd' and only a pipe) don't make sense, and I'll try to warn you about those.
The plot functions produce output for Gnuplot (so output type = 'd' again does not make sense). As above, you can pipe directly to Gnuplot or write to a file.
Assorted
General utilities
Math utilities
DeprecatedThese functions will probably disappear or be replaced soon.
Further referencesFor your convenience, here are links to some other libraries you are probably using.