Case Files with Uncertain Values – UVF Format

The case files discussed in previous pages have only had values that were completely certain (or completely missing).  But Netica can also create and read case files having values that are known with limited accuracy, or only known to within some likelihood.  In fact, Netica has a very elegant, practical and powerful way of expressing uncertain findings, called the UVF format.

When Netica reads in a case containing uncertain findings (for example, by choosing Cases Get Case), it will enter them in the Bayes net as likelihood findings, so any probabilistic inference, node absorption, sensitivity analysis, etc. will properly account for them.  Also, the operations on case files, such as learning from cases, test net with cases and process cases, will work properly on case files containing uncertain values.  When learning from such cases, some learning algorithms will work better than others.  For more information on that, and an example of working with case files having uncertain findings, see the learning algorithms page.

Below is a list of the different types of uncertain values, their syntax in the case file, and what they mean.  Each type of uncertain value can appear anywhere in a case file where a regular value normally would.  For example, a case file could be a regular CSV file, or tab delimited text file, but with some of the values replaced with entries having the syntax described below.

Gaussian

Syntax:

m+-s           m and s are real numbers

Examples:

5+-2    3.27+-0.03     0+-1e-5

This is for a Gaussian (also known as “normal”) likelihood finding, where the m is the mean and s is the standard deviation.  Note that there cannot be any space before or after the +-.  The uncertainties in measurements from lab instruments, or polling results, are often expressed with a ± notation, and indicate a Gaussian distribution, so they can now be easily input into Netica (although sometimes they may mean an interval distribution, as described below).

Interval

Syntax:

[a, b]    a and b are real numbers, state names or indexes preceded by #

Examples:

[0, 10]    [-3, 2.27]    [lo, med]    [#1, #3]

There may be spaces before or after the comma or brackets. Intervals of states include both endpoints, so [lo, med] includes states lo, med and any states between.  Intervals of numbers include the lower endpoint, but not the upper endpoint, so [0, 10] for variable X means 0 ≤ X < 10.  Likelihood within the interval is one; outside the interval it is zero.

Unbounded Interval

Syntax:

>m  or  <m    m is a real number, state name or state index preceded by #

Examples:

>4.75    <-10    <med    >#2

When m is a state, the interval includes the endpoint, and when it is a real number, the interval includes the endpoint only for > intervals (so > is really ³).  The interval can potentially extend to infinity, but in practice will probably be limited by known maximum or minimum values for the variable.  Likelihood within the interval is one; outside the interval it is zero.

Set of Possibilities

Syntax:

{s1, s2, … sn}   each si is a state name, state index preceded by #, interval,

 unbounded interval, or Gaussian.

 

Examples:

{lo, med}              {red, blue, green}   

{#1, #5, #7}           {[0,3.5], [4.5,10]}   

{[#35,#122], >#500}

There may be spaces before or after the comma or braces.  The value can be considered to be a disjunction of the elements (e.g. X=red or X=blue or X=green).  The likelihood of elements in the set is one; of those not in the set, it is zero.

Set of Impossibilities

Syntax:

~{s1, s2, … sn}  each si is a state name, state index preceded by #, interval or

 unbounded interval

 

Examples:

~{lo}                 ~{red, blue, green}      

~{#1, #5, #7}         ~{[0, 3.5]}

There may be spaces before or after the comma or braces, but not between the tilde (~) and the brace.  This is the same as "Set of Possibilities" except the "possible" states are those that are not listed, rather than those that are listed.  The likelihood of elements in the set is zero; of those not in the set, it is one.

A negative finding can be represented easily by just listing the state(s) eliminated by the observation.

Likelihood

Syntax:

{s1 p1, s2 p2, … sn pn}  each si is a state name, state index preceded by #,

 interval, unbounded interval, or Gaussian. Each pi is a

 number between 0 and 1.  Some pi may be absent.

 

Examples:

{female .8, male .3}       {3+-1 0.2, 7+-2 0.4}   

{[0,1.5] .5, [1.5,5] 0.1, [5,10] 0.02}

This is the same as a set of possibilities, but each possibility is weighted with a likelihood that appears after it (separated by a single space).  The most common kind of likelihood vectors are for discrete variables, where each state is listed, followed by its probability.  Any states that appear without a probability have a likelihood of 1, and any states that don't appear at all have a likelihood of 0.

Arbitrary likelihood distributions for continuous variables can be formed by a series of adjacent intervals, each with its own probability.  Or the elements can overlap, and then their likelihoods are combined.  For example {[0,10] .1, [2,4] .2} would be the combination of a rect function extending from 0 to 10 with height 0.1, and another rect from 2 to 4 with a height of 0.2.

Another useful distribution that is easy to form is the weighted combination of Gaussians.  For example {3+-1 0.2, 7+-2 0.4} is a bi-modal distribution with peaks at 3 and 7.

It is possible to mix weighted Gaussians, intervals, and discrete states within a single { ... } likelihood vector.

Relative Likelihood

Syntax:

~{s1 p1, s2 p2, … sn pn}  each si is a state name, state index preceded by #,

 interval, or unbounded interval. Each pi is a positive number.  

Some pi may be absent.

 

Examples:

~{red, green, teal .2, olive .8}   

~{[0,2] .4, [2,6] .2}

This is like the set of impossibilities, but each entry may have a weight, which appears after it. If no number appears after it, its weight is 0. Entries that have numbers above 1 are indicated to be more probable than those not listed, and entries with numbers below 1 are less probable than the unlisted ones, since unlisted entries have a weight of 1.  To convert this to a regular likelihood vector, fill in the missing entries and missing weights, then if the largest weight is greater than one, divide all the weights by the largest weight.  For example, if a variable can take on states a, b, c, d, e or f, and the UVF value is  ~{a 0.5, b, f 2} then that corresponds to a likelihood vector for (a, b, c, d, e, f) of:   (0.25, 0, 0.5, 0.5, 0.5, 1.0).

Complete Uncertainty

Syntax:

*    [i.e. the syntax is just an asterisk]

If nothing is known regarding the value of this variable (i.e. missing data), then a question mark ? or an asterisk * should be used to indicate that.  It is equivalent to ~{} which is a likelihood of all ones.