Case File Format

Structure: Case files (single-case or multi-case) are pure ASCII text files. They may contain “// ~ >[CASE 1] >~” or a time-author stamp, somewhere in the first 3 lines, but that is not normally present. Then comes a line consisting of headings for the columns. Each heading corresponds to one variable of the case, and is the name of the node used to represent the variable (sometimes the variables are called attributes and the entries in the column values, i.e. attribute-value). The headings are separated by spaces and/or tabs (it doesn’t matter how many). There should be no spaces in the names of the nodes.

The case data is next, with one case per line (a single-case file only has one such line). The values of the variables are in the same order as the heading line, and are separated by spaces or tabs (the columns don’t have to “line up” as they do in the examples below).

Discrete: The value of a discrete variable is given by its state name, state title, state number, or by its state index preceded by a ‘#’ character (the first state is #0). Using the state index is not recommended, since the order of the states may be changed sometime, and that would render a file with state indexes invalid. The ‘#’ symbol is recommended, but may be omitted if the node has no discretization or state numbers defined.

Continuous: The value of a continuous variable is given by a number in integer, decimal, or scientific notation (e.g. -3.21e-7). If it has been discretized, then the value may be given by a state name, title or index as for discrete variables, but the continuous number is preferred if it is available. That way the case file can be used for different discretizations of that variable in the future. It is best if the value has the correct number of significant figures, since future versions of Netica may use this information.

Missing: If the values of some of the variables are unknown for some of the cases, then an asterisk * is put in the file instead of the value. This is known as “missing data”. When reading case files, Netica can also understand a question mark ? used for missing data.

Uncertain Values: Interval values, Gaussians (mean and standard deviation), sets of possibilities, negative findings, etc. can also be entered in a case file using the UVF format.

Comments: There may be as many spaces or tabs at the end of a line as desired, and there may also be C / C++ / Java style comments (e.g. a double slash “//”, followed by any text).

IDnum: There are two special columns that a file may have which don’t correspond to nodes. One provides an identification number for each case, which must be an integer between 0 and 2 billion. The heading for this column is “IDnum”. Identification numbers do not have to be in order through the file. The missing data symbol * must not appear in this column.

NumCases: The other special column has the heading “NumCases”, and indicates the frequency or multiplicity of the case. A multiplicity of m indicates m cases with the same values (i.e. m identical rows). It is not required to be an integer, so it can be used to represent a frequency of occurrence if desired. The missing data symbol * must not appear in this column either. If there is no NumCases column, each case is assumed to have a multiplicity of 1 (which you can override with the degree - see "degree: while learning" in index).

Examples: Here is a listing of “Chest Clinic.cases”. It involves only discrete nodes with state names, and has an IDnum column, but no frequency column. Here is another example of a case file, this time for cars brought into a garage. It has discrete and continuous variables, state indexes and state names, and asterisks for missing entries.

Future: Future versions of Netica will support more advanced operations with cases, including a more efficient file representation, and a way of using Bayes nets as “indexing functions” to do the kind of lookup common in case-based reasoning. However, the above described type of file format will always be supported as well.