![]() |
|
If you are reading into an array or gsl_matrix or apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set
beforehand.
You will also be interested in apop_opts.input_delimiters. By default, it is set to "| ,\t", meaning that a pipe, comma, space, or tab will delimit separate entries. Try
strcpy(apop_opts.input_delimiters, ";")
to set the delimiter to a semicolon, for example.
There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert an NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, Apophenia checks whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then an NAN is inserted.
If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:
perl -pi.bak -e 's/,,/,NaN,/g' data_file
If you have missing data delimiters, you will need to set apop_opts.db_nan to a regular expression that matches the given format. Some examples:
//Apophenia's default NaN string, matching NaN, nan, or NAN: strcpy(apop_opts.db_nan, "\\(NaN\\|nan\\|NAN\\)"); //Literal text: strcpy(apop_opts.db_nan, "Missing"); //Literal text, capitalized or not: strcpy(apop_opts.db_nan, "[mM]issing"); //Matches two periods. Periods are special in regexes, so they need backslashes. strcpy(apop_opts.db_nan, "\\.\\.");
Text is always delimited by quotes. Delimiters inside quotes are perfectly OK, e.g., "Males, 30-40", is an OK column name.
Lines beginning with # (i.e. in the first column) are taken to be comments and ignored.
If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the row with column names like 'row names'. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).
The maximum line length is 100,000 characters. If you have a line longer than this, you will need to open up apop_conversions.c, modify Text_Line_Limit, and recompile.
If you are reading into an array or gsl_matrix or apop_data set, all text fields are taken as zeros. You will be warned of such substitutions unless you set
beforehand.
You will also be interested in apop_opts.input_delimiters. By default, it is set to "| ,\t", meaning that a pipe, comma, space, or tab will delimit separate entries. Try
strcpy(apop_opts.input_delimiters, ";")
to set the delimiter to a semicolon, for example.
There are often two delimiters in a row, e.g., "23, 32,, 12". When it's two commas like this, the user typically means that there is a missing value and the system should insert an NAN; when it is two tabs in a row, this is typically just a formatting glitch. Thus, if there are multiple delimiters in a row, Apophenia checks whether the second (and subsequent) is a space or a tab; if it is, then it is ignored, and if it is any other delimiter (including the end of the line) then an NAN is inserted.
If this rule doesn't work for your situation, you can explicitly insert a note that there is a missing data point. E.g., try:
perl -pi.bak -e 's/,,/,NaN,/g' data_file
If you have missing data delimiters, you will need to set apop_opts.db_nan to a regular expression that matches the given format. Some examples:
//Apophenia's default NaN string, matching NaN, nan, or NAN: strcpy(apop_opts.db_nan, "\\(NaN\\|nan\\|NAN\\)"); //Literal text: strcpy(apop_opts.db_nan, "Missing"); //Literal text, capitalized or not: strcpy(apop_opts.db_nan, "[mM]issing"); //Matches two periods. Periods are special in regexes, so they need backslashes. strcpy(apop_opts.db_nan, "\\.\\.");
The system also uses the standards for C's atof() function for floating-point numbers: INFINITY, -INFINITY, and NaN work as expected. I use some tricks to get SQLite to accept these values, but they work.
Text is always delimited by quotes. Delimiters inside quotes are perfectly OK, e.g., "Males, 30-40", is an OK column name, as is "Males named \\"Joe\"".
Lines beginning with # (i.e. in the first column) are taken to be comments and ignored. Blank lines are also ignored.
If there are row names and column names, then the input will not be perfectly square: there should be no first entry in the row with column names like 'row names'. That is, for a 100x100 data set with row and column names, there are 100 names in the top row, and 101 entries in each subsequent row (name plus 100 data points).