Apophenia is an open statistical library for working with data sets and statistical models. It provides functions
on the same level as those of the typical stats package (such as OLS, probit, or
singular value decomposition) but doesn't tie the user to an
ad hoc language or environment. The core functions are written in C, but should be easy to bind to functions in Perl/Phython/&c.
It is written to scale well. If you have tried to analyze
your gigabyte data set using other open source tools but
found that they weren't up to handling
large data sets or exceptionally computationally-intensive
work, Apophenia is the library for you.
[By the way, this page (and its CSS) is 100% valid
XHTML. If your browser can't render it legibly, please try this page.]
The goods
To date, the library has over a hundred functions to facilitate statistical computing, such as:
- maximum likelihood estimators for probit, Exponential, Gamma, Waring, Yule, Zipf, &c. estimators
- OLS and GLS
- database querying and maintenance utilities
- moments, percentiles, and other basic stats utilities
- singular value decomposition tools
- t-tests, F-tests, et cetera
Most users will just want to download the autoconf-packaged library here.
Those who would like to work on a cutting-edge copy of the source code
can get the latest version by cutting and pasting the following onto
the command line.
svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia
The online reference for Apophenia is here. The reader may
also be interested in the textbook
Modeling with Data (PDF),
which discusses general methods for doing statistics in C with the GSL
and SQLite, as well as Apophenia itself. Finally, the M.W.D. website has a few notes on Apophenia's raison d’être and logic.
We have the technology
There is no need to
reinvent the wheel in the process of rebuilding our regression
functions. The Apophenia library is based on two lower-level
libraries: the GNU
Scientific Library, which does the number-crunching, and SQLite, which handles the data.
The Frequently Asked Question: Why not use [name of stats package]?
- Matrices and databases. There are things you can
do with a one-line database query that you need a hundred lines of
matrix-manipulation code to do; there are things you can do with matrices
that you simply can't do with a database query. A good stats library
therefore takes both representations of data seriously.
- Models as objects. the apop_model object is unique among stats packages in
providing a consistent interface to linear models, probability distributions, and
exotic models that can only be solved via maximum likelihood. The consistent interface means
that you can compare several models at once, or can construct multilevel models or
creative variant models by using standard models as building blocks. Simply put, having statistical
models as objects is nifty.
- Better MLEs. The package focuses on facilitating maximum
likelihood estimation. The usual OLS and GLS are still there, but since the
world isn't linear, Apophenia focuses on giving you methods of
fitting generally-specified models via MLE.
- Not slow; not limited. First, the software imposes no restrictions on data
size (Stata says: "Your matrix must be less than 4,000
columns"). Second, Apophenia shares C code with certain open source
stats packages, and yet runs that same code over fifty (50)
times faster. Apophenia's speed and effectively unlimited data
handling mean that it is the only open source option for
statistical analysis of very large data sets.
- Open source and portable.
The packages Apophenia uses are ported to almost
any computer you will ever use. You can begin your analysis on the
university/company servers, then send it to a colleague, then copy it to
your laptop for the ride home, and never worry about compatibility or
licensing.
Contribute!
You don't need to eat C code for breakfast to help. Ways you can contribute:
- Report bugs or suggest features.
- Package Apophenia into an RPM, apt, portage, cygwin package.
- Write bindings for your preferred language.
- Contribute your favorite statistical routine.
- Help make the C code base more robust and still faster.
If you're interested, write to the maintainer (Ben Klemens), join the
SourceForge project, or just keep an eye on things via the
mailing list.