Apophenia is an open statistical library for working with data sets and statistical models. It provides functions
on the same level as those of the typical stats package (such as OLS, probit, or
singular value decomposition) but gives the user more flexibility to be creative in model-building.
The core functions are written in C, but bindings exist for Python (and should be easy to bind to in Perl/Ruby/&c.)
It is written to scale well. If you have tried to analyze
your gigabyte data set using other open source tools but
found that they weren't up to handling
large data sets or exceptionally computationally-intensive
work, Apophenia is the library for you.
The goods
To date, the library has over two hundred functions to facilitate scientific computing, such as:
- OLS and family, discrete choice models like probit and logit, kernel density estimators, and other common models
- database querying and maintenance utilities
- moments, percentiles, and other basic stats utilities
- t-tests, F-tests, et cetera
- Several maximum likelihood estimation methods available for your own new models
- It does not re-implement basic matrix operations or build yet another database
engine. Instead, it builds upon the excellent GNU
Scientific and SQLite libraries. MySQL is also supported.
Most users will just want to download the packaged version linked from the header.
Those who would like to work on a cutting-edge copy of the source code
can get the latest version by cutting and pasting the following onto
the command line.
svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia
The online reference for Apophenia is here. The reader may
also be interested in the textbook
Modeling with Data,
which discusses general methods for doing statistics in C with the GSL
and SQLite, as well as Apophenia itself.
The Frequently Asked Question: Why not use [name of stats package]?
- Matrices and databases. There are things you can
do with a one-line database query that you need a hundred lines of
matrix-manipulation code to do; there are things you can do with matrices
that you simply can't do with a database query. A good stats library
therefore takes both representations of data seriously.
- Models as objects. the apop_model object is unique among stats packages in
providing a consistent interface to linear models, probability distributions, and
exotic models that can only be solved via maximum likelihood. The consistent interface means
that you can compare several models at once, or can construct multilevel models or
creative variant models by using standard models as building blocks. Simply put, having statistical
models as objects is nifty.
- Better MLEs. The package focuses on facilitating maximum
likelihood estimation. The usual OLS and GLS are still there, but since the
world isn't linear, Apophenia focuses on giving you methods of
fitting generally-specified models via MLE.
- Not slow; not limited. First, the software imposes no restrictions on data
size (Stata says: "Your matrix must be less than 4,000
columns"). Second, Apophenia shares C code with certain open source
stats packages, and yet runs that same code over fifty (50)
times faster. Apophenia's speed and effectively unlimited data
handling mean that it is the only open source option for
statistical analysis of very large data sets.
- Open source and portable.
The packages Apophenia uses are ported to almost
any computer you will ever use. You can begin your analysis on the
university/company servers, then send it to a colleague, then copy it to
your laptop for the ride home, and never worry about compatibility or
licensing.
Contribute!
You don't need to eat C code for breakfast to help. Ways you can contribute:
- Report bugs or suggest features.
- Package Apophenia into an RPM, apt, portage, cygwin package.
- Write bindings for your preferred language.
- Contribute your favorite statistical routine.
- Help make the C code base more robust and still faster.
If you're interested, write to the maintainer (Ben Klemens), join the
SourceForge project, or just keep an eye on things via the
mailing list.