CSCI A113
Lecture Notes Four

Fall 2001


Probability. Some Review. The Normal Curve.

1. Probability

A die (plural: dice) is a small cube marked on each face with from one to six spots and used usually in pairs in various games and in gambling by being shaken and thrown to come to rest at random on a flat surface. Let's visualize that, if you will:

When we toss a die in an experiment, we get one of the following possible outcomes:
E = { 1, 2, 3, 4, 5, 6 }
To be more general we could denote this set of outcomes as follows:
E = { e1, e2, e3, e4, e5, e6 }
To be even more general we could abstract the number of events as well:
E = { e1, e2, ..., em }
and keep in mind that for a single die m = 6 and ei = i.

Note that the experiment involves throwing just one die.

The collection of all these events is called the event space for the experiment.

Describe the event space when we throw two dice (it's a set of ordered pairs).

Describe the event space when we throw two dice and are interested in the sum of their points.

If the die is unbiased, when we perform an experiment, the likelihood that we get one of the six events is the same for each event.

Assume that we do n > 0 experiments.

Let f (ei) be the frequency of occurrence of event ei in the n experiments.

The experimental probability of event ei in n experiments, is defined as

It should be immediate that:
Prove it.

The first part of the lab tomorrow will be probability related.

Next, let's review some of the things we did last time.

2. Central Statistics

Let E be a number-based event space and let

X = (x1, x2, ..., xn)
be a list of events from E that originate from n experiments.

The mean (also called average or expected value) of X is defined as:

If we denote then the mean can also be defined as follows:
Prove it.

This expression links the theory of probability to the statistical notion of mean.

The mean is a statistic of so called central tendency.

In fact, one can prove that:

Prove it.

If we take the sum over the differences between each one of the observed events and the mean we obtain a value of 0 (zero). The differences cancel each other out. Hence the mean is the most central value (from this point of view, almost a center of mass) that characterizes the observed events.

Let's summarize here the properties of the mean:

This last property is very important and is used in many areas of statistics, particularly in regression. Elaborated a little more fully, this property states that although the sum of the squared deviations about the mean does not usually equal zero, it is smaller than if squared deviations were taken about any other value. You have tested this property in Lab Three, yesterday.

There are other measures of central tendency, such as the median and the mode.

Let's review them.

They both rely on the idea of sorting.

Let

X = (x1, x2, ..., xn)
be a list of n experiments.

Since the xj's are numbers we can sort the list X in ascending (increasing) order.

Assume that the result is the list

(y1, y2, ..., yn)
The median of X is defined as:
The median has the following property: The median is also a measure of central tendency. One property of the median worth noting is that it is less sensitive than the mean to extreme scores (values, or events). Because the it's usually less stable than the mean from sample to sample, the median is not as useful in inferential statistics.

The third and last measure of central tendency that we shall discuss is the mode.

The mode is defined as the most frequent score in the distribution. (When all scores in the distribution have the same frequency, it is customary to say that the distribution has no mode).

Clearly, this is the easiest of the three measures to determine. The mode is found by inspection of the scores; there isn't any calculation necessary.

Usually distributions are unimodal; that is, they have only one mode. However, it is possible for a distribution to have many modes. When a distribution has two modes, the distribution is called biimodal. In general, with more than two modes a distribution is called multimodal.

Measures of Central Tendency and Symmetry.

If the distribution is unimodal and symmetrical, then the mean, median and mode will all be equal. When the distribution is skewed, the mean and median will not be equal. Since the mean is most affected by extreme scores, it will have a value closer to the extreme scores than will the median. Thus, with a negatively skewed distribution, the mean will be lower than the median. With a positively skewed curve, the mean will be larger than the median.

Here's a picture that illustrates the point:

3. Statistics of Dispersion

Measures of Variability

Variability has to do with how far the scores (or values obtained, measured in the experiments) are spread apart. Whereas measures of central tendency are a quantification of the average value of the distribution, measures of variability quantify the extent of dispersion. There are three measures of variability we will be looking at:

The range is defined as the difference between the highest and lowest score in the distribution.

The range is easy to calculate but gives us only a relatively crude measure of dispersion, because the range really only measures the spread of the two extreme scores and not the spread of any of the scores in between. (By scores we mean, as usual, observed events, or measurements).

Let

X = (x1, ..., xn)
be a list of n experiments.

The variance of the list X is defined as follows:

Alternatively, the variance of X is defined as the mean of the list
that is,
As with the mean, we can define the variance of X in terms of the experimental probabilities of the events:
The variance is not used much in descriptive statistics because it gives us squared units of measurement. It is used, however, quite frequently in inferential statistics.

The standard deviation of a list of experiments X is defined as follows:

The standard deviation is the most frequently encountered measure of variability.

The standard deviation has many important characteristics:

The last property is one of the main reasons why the standard deviation is used so much more often than the range for reporting variability. In addition to that, both the mean and the standard deviation can be manipulated algebraically. This allows mathematics to be done with them for use in inferential statistics. We now take a look at a particular curve.

4. The Normal Curve

The normal curve is a theoretical distribution of population scores. It is a bell shaped curve which, like most other curves, has an equation that describes it

We're going to look at this curve in lab tomorrow, and in lab and lecture all week next week.


Last updated: November 1, 2001 by Adrian German for A113