![]() |
|
Go to the source code of this file.
| #define apop_gaussian |
Alias for the apop_normal distribution, qv.
| #define apop_model_set_parameters | ( | in, | |||
| ... | ) | apop_model_set_parameters_base((in), (double []) {__VA_ARGS__}) |
Take in an unparametrized apop_model and return a new apop_model with the given parameters. This would have been called apop_model_parametrize, but the OED lists four acceptable spellings for parameterise, so it's not a great candidate for a function name.
For example, if you need a N(0,1) quickly:
apop_model *std_normal = apop_model_set_parameters(apop_normal, 0.0, 1.0);
This doesn't take in data, so it won't work with models that take the number of parameters from the data, and it will only set the vector of the model's parameter apop_data set. This is most standard models, so that's not a real problem either. If you have a situation where these options are out, you'll have to do something like apop_model *new = apop_model_copy(in); apop_model_clear(your_data, in); and then set in->parameters using your data.
| in | An unparametrized model, like apop_normal or apop_poisson. | |
| ... | The list of parameters. |
| apop_model* apop_beta_from_mean_var | ( | double | m, | |
| double | v | |||
| ) |
The Beta distribution is useful for modeling because it is bounded between zero and one, and can be either unimodal (if the variance is low) or bimodal (if the variance is high), and can have either a slant toward the bottom or top of the range (depending on the mean).
The distribution has two parameters, typically named
and
, which can be difficult to interpret. However, there is a one-to-one mapping between (alpha, beta) pairs and (mean, variance) pairs. Since we have good intuition about the meaning of means and variances, this function takes in a mean and variance, calculates alpha and beta behind the scenes, and returns a random draw from the appropriate Beta distribution.
| m | The mean the Beta distribution should have. Notice that m is in [0,1]. | |
| v | The variance which the Beta distribution should have. It is in (0, 1/12), where (1/12) is the variance of a Uniform(0,1) distribution. Funny things happen with variance near 1/12 and mean far from 1/2. |
apop_beta model with its parameters appropriately set. | void apop_draw | ( | double * | out, | |
| gsl_rng * | r, | |||
| apop_model * | m | |||
| ) |
draw from a model. If the model has its own RNG, then you're good to go; if not, use apop_arms_draw to generate random draws.
That function has a lot of caveats: most notably, the input data will be univariate, and your likelihood function must be nonnegative and sum to one. If those aren't appropriate, then don't use this default. [A more forgiving default is on the to-do list.]
| apop_model* apop_estimate | ( | apop_data * | d, | |
| apop_model | m | |||
| ) |
estimate the parameters of a model given data.
This is a brief convenience function, which expands to m.estimate(d,&m). If your model has no estimate method, then I assume apop_maximum_likelihood(d, m), with the default MLE params.
| d | The data | |
| m | The model |
parameters element filled in. | apop_data* apop_expected_value | ( | apop_data * | d, | |
| apop_model * | m | |||
| ) |
Return a matrix of the expected value for each observation or for the model as a whole, as the case may be. This may also be set within the model.
| double apop_log_likelihood | ( | apop_data * | d, | |
| apop_model * | m | |||
| ) |
Find the log likelihood of a data/parametrized model pair.
| d | The data | |
| m | The parametrized model, which must have either a log_likelihood or a p method. |
| apop_model* apop_model_clear | ( | apop_data * | data, | |
| apop_model * | model | |||
| ) |
Allocate an apop_model.
This sets up the output elements of the apop_model: the parameters, covarinace, and expected data sets.
At close, the input model has parameters of the correct size, the covariance and expected elements are NULL, and the status element is zero, indicating no estimation has been done yet.
The input model is modified, so you may want to call this after you call apop_model_copy.
| data | If your params vary with the size of the data set, then the function needs a data set to calibrate against. Otherwise, it's OK to set this to NULL | |
| model | The model whose output elements will be modified. |
| apop_model* apop_model_copy | ( | apop_model | in | ) |
Outputs a copy of the apop_model input.
| in | The model to be copied |
| void apop_model_free | ( | apop_model * | free_me | ) |
Free an apop_model structure.
The parameters, expected, and covariance elements are freed. These are all the things that are completely copied, by apop_model_copy, so the parent model is still safe after this is called. data is not freed, because the odds are you still need it.
The system has no idea what the more element contains, so if they point to other things, they need to be freed before calling this function.
If free_me is NULL, this does nothing.
| free_me | A pointer to the model to be freed. |
| void apop_model_prep | ( | apop_data * | d, | |
| apop_model * | m | |||
| ) |
The default prep is to simply call apop_model_clear. If the function has a prep method, then that gets called instead.
| double apop_p | ( | apop_data * | d, | |
| apop_model * | m | |||
| ) |
Find the probability of a data/parametrized model pair.
| d | The data | |
| m | The parametrized model, which must have either a log_likelihood or a p method. |
| void apop_score | ( | apop_data * | d, | |
| gsl_vector * | out, | |||
| apop_model * | m | |||
| ) |
Find the vector of derivatives of the log likelihood of a data/parametrized model pair.
| d | The data | |
| out | The score to be returned. I expect you to have allocated this already. | |
| m | The parametrized model, which must have either a log_likelihood or a p method. |
The Bernoulli model.
Data format: the matrix or vector can have any size, and I just count up zeros and non-zeros. The bernoulli paramter
is the percentage of non-zero values in the matrix. Its variance is
.
The beta distribution.
The binomial model.
The parameters are kept in the vector element of the apop_model parameters element. parameters->vector->data[0]==n; parameters->vector->data[1]==p.
Input data can take two forms:
The default is to take the data to have a binary form, meaning that the system counts zeros as failures and non-zeros as successes.
is the size of the matrix, vector, or both (whichever is not NULL).
In rank-type format, the data is taken to be the two-column miss-hit format: a nonzero value in column zero of the matrix represents a failure and a nonzero value in column one represents successes. Set this using, e.g.,
apop_model *estimate_me = apop_model_copy(apop_binomial); Apop_model_add_group(estimate_me, apop_rank); apop_model *estimated = apop_estimate(your_data, estimate_me);
In both cases,
represents the odds of a success==1; the odds of a zero is
.
See also the apop_multinomial model.
The Dirichlet distribution, a multivariate generalization of the Beta distribution. Each row of your data is a single observation. The estimated parameters are in the output model's parameters->vector.
The Exponential distribution. A one-parameter likelihood fn.
Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.
Some write the function as:
If you prefer this form, just convert your parameter via
(and convert back from the parameters this function gives you via
.
To specify that you have frequency or ranking data, use
Apop_model_add_group(your_model, apop_rank);
The Gamma distribution gsl_matrix *data = d->matrix;
Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.
apop_gamma.estimate() is an MLE, so feed it appropriate apop_mle_settings.
To specify that you have frequency or ranking data, use
apop_data *my_copy = apop_model_copy(apop_gamma); Apop_model_add_group(my_copy, apop_rank); apop_model *out = apop_estimate(my_rank_data, my_copy);


(also,
)

The Gamma distribution
Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.
apop_gamma.estimate() is an MLE, so feed it appropriate apop_mle_settings.
To specify that you have frequency or ranking data, use
Apop_settings_add_group(your_model, apop_rank, NULL);


(also,
)
The histogram model.
This is an empirical distribution. If you have a data set from which you want to make random draws, this is overkill; instead just use something like
gsl_rng *r = apop_rng_alloc(27); gsl_vector *my_data = [gather data here.]; gsl_vector_get(my_data, gsl_rng_uniform(r)*my_data->size);
But this can be used anywhere a model is needed, such as the inputs and outputs to apop_update.
The model is unlike most other models in that there are no parameters of any sort (beyond the data itself), so there is no estimate method; instead all the work of producing the histogram is done in apop_histogram_settings_alloc. [Actually, there is an estimate method, but it is just an alias for the histogram_alloc function.]
The improper uniform returns P(x) = 1 for every value of x, all the time (and thus, log likelihood(x)=0). It has zero parameters. It is useful, for example, as an input to Bayesian updating, to represent a fully neutral prior.
The estimate routine is just a dummy that returns its input.
The draw function makes no sense, and therefore returns an error.
Instrumental variable regression
Operates much like the apop_ols model, but the input parameters also need to have a table of substitutions. The vector element of the table lists the column numbers to be substituted (the dependent var is zero; first independent col is one), and then one column for each item to substitute.
If the vector of your apop_data set is NULL, then I will use the row names to find the columns to substitute. This is generally more robust and/or convenient.
If the instruments data set is somehow NULL or empty, I'll just run OLS.
apop_data *submatrix =apop_data_alloc(0, data->matrix->size1, 2); APOP_COL(submatrix, 0, firstcol); gsl_vector_memcpy(firstcol, your_data_vector); APOP_COL(submatrix, 1, secondcol); gsl_vector_memcpy(firstcol, your_other_data_vector); apop_name_add(submatrix->names, "subme_1", 'r'); apop_name_add(submatrix->names, "subme_2", 'r'); Apop_model_add_group(&apop_iv, apop_ls, .instruments = submatrix); apop_model *est = apop_estimate(data, apop_iv); apop_model_show(est);
The apop_kernel_density model.
A Kernel density is simply a smoothing of a histogram. At each point along the histogram, put a distribution (default: Normal(0,1)) on top of the point. Sum all of these distributions to form the output histogram.
The output is a histogram that behaves exactly like the gsl_histogram, except the histobase and kernelbase elements are set.
The Logit model. The first column of the data matrix this model expects a number indicating the preferred category; the remaining columns are values of the independent variables. Thus, the model will return N-1 columns of parameters, where N is the number of categories chosen.
The lognormal distribution.
The multinomial model.
The parameters are kept in the vector element of the apop_model parameters element. parameters->vector->data[0]==n; parameters->vector->data[1...]==p_1....
Input data can take two forms:
The default is simply a listing of bins, without regard to whether items are in the vector or matrix of the apop_data struct, or the dimensions. Here, data like 0, 1, 2, 1, 1 represents one draw of zero, three draws of 1, and one draw of 2.
In rank-type format, the bins are defined by the columns: a nonzero value in column zero of the matrix represents a draw of zero, a nonzero value in column seven a draw of seven, et cetera. Set this form using, e.g.,
apop_model *estimate_me = apop_model_copy(apop_binomial); Apop_model_add_group(estimate_me, apop_rank); apop_model *estimated = apop_estimate(your_data, estimate_me);
In both cases, the numeraire is zero, meaning that
is not explicitly listed, but is
, where
is the number of bins. Conveniently enough, the zeroth element of the parameters vector holds
, and so a full probability vector can easily be produced by overwriting that first element. Continuing the above example:
int n = apop_data_get(estimated->parameters, 0, -1); apop_data_set(estimated->parameters, 0, 1 - (apop_sum(estimated->parameters)-n));
And now the parameter vector is a proper list of probabilities.
See also the apop_binomial model.
The Multinomial Probit model. The first column of the data matrix this model expects a number indicating the preferred category; the remaining columns are values of the independent variables. Thus, the model will return N-1 columns of parameters, where N is the number of categories chosen.
This is the multivarate generalization of the Normal distribution. The probability/log_likelihood methods take in an apop_data set whose vector element is the vector of means, and whose matrix is the covariances; the estimate method returns parameters of that form.
The RNG fills an input array whose length is based on the input parameters.
You know it, it's your attractor in the limit, it's the Gaussian distribution.
As is custom, the first parameter is the mean, the second is the standard deviation (i.e., the square root of the variance).
The log likelihood function and dlog likelihood don't care about your rows of data; if you have an 8 x 7 data set, it will give you the log likelihood of those 56 observations given the mean and variance you provide.




The apop_ls settings group includes a want_cov element, which refers to the covariance matrix for the mean and variance. You can set that to 'n' if you are estimating millions of these and need to save time (i.e. Apop_model_add_group(your_model, apop_ls, .want_cov = 'n');).
Ordinary least squares.
| inset | The first column is the dependent variable, and the remaining columns the independent. Is destroyed in the process, so make a copy beforehand if you need. | |
| epin | An apop_model object. It may have a apop_ls_settings group attached. I'll look at the destroy_data element; if this is NULL or destroy_data==0, then the entire data set is copied off, and then mangled. If destroy_data==1, then this doesn't copy off the data set, but destroys it in place. I also look at want_cov and want_expected_value, though I'll not produce the covariance matrix only if both are 'n'. |
*. The_result->parameters will hold the coefficients; the first coefficient will be the coefficient on the constant term, and the remaining will correspond to the independent variables. It will therefore be of size (data->size2). Do not pre-allocate.If you asked for them, the covariance matrix, confidence levels, and residuals will also be returned. The confidence intervals give the level of certainty with which we can reject the hypothesis that the given coefficient is zero.
See also the page on Data prep rules.
sample
First, you will need a file named data in comma-separated form. The first column is the dependent variable; the remaining columns are the independent. For example:
Y, X_1, X_2, X_3 2,3,4,5 1,2,9,3 4,7,9,0 2,4,8,16 1,4,2,9 9,8,7,6
The program:
#include <apop.h> int main(void){ apop_data *data; apop_model *est; apop_text_to_db("data","d"); data = apop_query_to_data("select * from d"); est = apop_estimate(data, apop_ols); apop_model_show(est); return 0; }
If you saved this code to sample.c, then you can compile it with
gcc sample.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me
and then run it with ./run_me. Alternatively, you may prefer to compile the program using a Makefile .
Feeling lazy? The program above was good form and demonstrated useful features, but the code below will do the same thing in two lines:
#include <apop.h> int main(){ apop_model_show(apop_estimate(apop_text_to_data("data"), apop_ols)); }
The poisson distribution
Location of data in the grid is not relevant; send it a 1 x N, N x 1, or N x M and it will all be the same.
The Probit model. The first column of the data matrix this model expects is ones and zeros; the remaining columns are values of the independent variables. Thus, the model will return (data columns)-1 parameters.
The uniform model. This is the two-parameter version of the uniform, expressing a uniform distribution over [a, b].
The MLE of this distribution is simply a = min(your data); b = max(your data). Primarily useful for the RNG, such as when you have a Uniform prior model.
The Waring distribution Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.
apop_waring.estimate() is an MLE, so feed it appropriate apop_mle_settings.
![$W(x,k, b,a) = (b-1) \gamma(b+a) \gamma(k+a) / [\gamma(a+1) \gamma(k+a+b)]$](form_105.png)



To specify that you have frequency or ranking data, use
Apop_settings_add_group(your_model, apop_rank, NULL);
The WLS model
You will need to provide the weights in yourdata->weights. Otherwise, this model will behave just like apop_ols.
The Yule distribution
The special case of Waring where 
apop_yule.estimate() is an MLE, so feed it appropriate apop_mle_settings.



To specify that you have frequency or ranking data, use
Apop_settings_add_group(your_model, apop_rank, NULL);
The Zipf distribution. Wikipedia has notes on the Zipf distribution.
Ignores the matrix structure of the input data, so send in a 1 x N, an N x 1, or an N x M.
apop_zipf.estimate() is an MLE, so feed it appropriate apop_mle_settings. 


To specify that you have frequency or ranking data, use
Apop_settings_add_group(your_model, apop_rank, NULL);