NETICA CASE FILE FORMAT ----------------------- 2002-11-02 Copyright 2002-2004 by Norsys Software Corp. Structure --------- Case files (single-case or multi-case) are pure ASCII text files. They may contain "// ~->[CASE-1]->~" somewhere in the first 3 lines. Then comes a line consisting of headings for the columns. Each heading corresponds to one variable of the case, and is the name of the node used to represent the variable (sometimes the variables are called attributes and the entries in the column values, i.e., attribute-value). The headings are separated by spaces and/or tabs (it doesn't matter how many). The case data is next, with one case per line (a single-case file only has one such line). The values of the variables are in the same order as the heading line, and are separated by spaces or tabs (the columns don't have to "line up" as they do in the examples below). Discrete Variables ------------------ The value of a discrete variable is given by its state name, or by its state number preceded by a '#' character (the first state is #0). Using the state names is preferred, since the order of the states may be changed sometime, and that would render a file with state numbers invalid. The '#' symbol is recommended, but may be omitted if the node has no ranges or values defined. Continuous Variables -------------------- The value of a continuous variable is given by a number in integer, decimal, or scientific notation (e.g. -3.21e-7). If it has been discretized, then the value may be given by a state name or state number instead, but the continuous number is preferred if it is available. That way the case file can be used for different discretizations of that variable in the future. It is best if the value has the correct number of significant figures, since future versions of Netica may use this information. Missing Values -------------- If the values of some of the variables are unknown for some of the cases, then an asterisk * is put in the file instead of the value. This is known as "missing data". Comments -------- There may be as many spaces or tabs at the end of a line as desired, and there may also be C / C++ / Java style comments (e.g. a double slash "//", followed by any text). IDnum ----- There are two special columns that a file may have which don't correspond to nodes. One provides an identification number for each case, which must be an integer between 0 and 2 billion. The heading for this column is "IDnum". Identification numbers do not have to be in order through the file. The missing data symbol, *, must not appear in this column. NumCases -------- The other special column has the heading "NumCases", and indicates the frequency or multiplicity of the case. A multiplicity of M indicates M cases with the same variable values. It is not required to be an integer, so it can be used to represent a frequency of occurrence if desired. The missing data symbol, *, must not appear in this column either. Examples -------- 1. Here is a listing of "asia.cases". It involves only discrete nodes with state names, and has an IDnum column, but no frequency column. // ~->[CASE-1]->~ IDnum VisitAsia Tuberculosis Smoking Cancer TbOrCa XRay Bronchitis Dyspnea 1 No_Visit Present Smoker Absent True Abnormal Absent Present 2 No_Visit Absent Smoker Absent False Normal Present Present 3 No_Visit Absent Smoker Present True Abnormal Present Present 4 No_Visit Absent NonSmoker Absent False Normal Absent Absent 5 No_Visit Absent Smoker Present True Abnormal Present Present 6 No_Visit Absent Smoker Absent False Abnormal Present Present ... 119 No_Visit Absent Smoker Absent False Normal Present Present 120 No_Visit Absent Smoker Present True Abnormal Present Present 2. Here is another example of a case file, this time for cars brought into a garage. It has discrete and continuous variables, state numbers and state names, and asterisks for missing entries. // ~->[CASE-1]->~ Starts BatAge Cranks Lights StMotor SpPlug MFuse Alter BatVolt Dist PlugVolt Timing False 5.9 False #0 * fouled okay * dead * * good False 1.3 False #0 * okay okay * dead * none bad False 5.2 False #0 Okay okay okay Okay dead Okay none good True 4.1 True #2 * okay okay * strong Okay strong * True 2.7 * #2 * wide okay * strong Okay * * * * True #2 * fouled okay * * Okay strong good False 1.7 True #0 Okay okay okay Okay dead * none good True 2.9 True #2 * * * * strong Okay strong * Future ------ Future versions of Netica will support more advanced operations with cases, including a more efficient file representation, and a way of using belief networks as "indexing functions" to do the kind of lookup common in case-based reasoning. However, the above described type of file format will always be supported as well.