| |
CSCI A113
| ![]() |
We start with a brief review of lectures past, then move on to the new material.
1. Probability
A die (plural: dice) is a small cube marked on each face with from one to six spots and used usually in pairs in various games and in gambling by being shaken and thrown to come to rest at random on a flat surface.
When we toss a die in an experiment, we get one of the following possible outcomes:
E = { 1, 2, 3, 4, 5, 6 }To be more general we could denote this set of outcomes as follows:
E = { e1, e2, e3, e4, e5, e6 }To be even more general we could abstract the number of events as well:
E = { e1, e2, ..., em }and keep in mind that for a single die m = 6 and ei = i.
Note that the experiment involves throwing just one die.
The collection of all these events is called the event space for the experiment.
Describe the event space when we throw two dice (it's a set of ordered pairs).
Describe the event space when we throw two dice and are interested in the sum of their points.
If the die is unbiased, when we perform an experiment, the likelihood that we get one of the six events is the same for each event.
Assume that we do n > 0 experiments.
Let f (ei) be the frequency of occurrence of event ei in the n experiments.
The experimental probability of event ei in n experiments, is defined as
It should be immediate that:
Prove it.
2. Central Statistics
Let E be a number-based event space and let
X = (x1, x2, ..., xn)be a list of events from E that originate from n experiments.
The mean (also called average or expected value) of X is defined as:
If we denote
Prove it. This expression links the theory of probability to the statistical notion of mean.
The mean is a statistic of so called central tendency.
In fact, one can prove that:
Prove it. If we take the sum over the differences between each one of the observed events and the mean we obtain a value of 0 (zero). The differences cancel each other out. Hence the mean is the most central value (from this point of view, almost a center of mass) that characterizes the observed events.
Let's summarize here the properties of the mean:
This last property is very important and is used in many areas of statistics, particularly in regression. Elaborated a little more fully, this property states that although the sum of the squared deviations about the mean does not usually equal zero, it is smaller than if squared deviations were taken about any other value.
There are other measures of central tendency, such as the median and the mode.
They both rely on the idea of sorting.
Let
X = (x1, x2, ..., xn)be a list of n experiments.
Since the xj's are numbers we can sort the list X in ascending (increasing) order.
Assume that the result is the list
(y1, y2, ..., yn)The median of X is defined as:
The median has the following property:
The third and last measure of central tendency that we shall discuss is the mode.
The mode is defined as the most frequent score in the distribution. (When all
scores in the distribution have the same frequency, it is customary to say that the distribution has no mode).
Clearly, this is the easiest of the three measures to determine. The mode is found by inspection of the scores; there isn't any calculation necessary.
Usually distributions are unimodal; that is, they have only one mode. However, it is possible for a distribution to have many modes. When a distribution has two modes, the distribution is called biimodal. In general, with more than two modes a distribution is called multimodal.
Measures of Central Tendency and Symmetry.
If the distribution is unimodal and symmetrical, then the mean, median and mode will all be equal. When the distribution is skewed, the mean and median will not be equal. Since the mean is most affected by extreme scores, it will have a value closer to the extreme scores than will the median. Thus, with a negatively skewed distribution, the mean will be lower than the median. With a positively skewed curve, the mean will be larger than the median.
Here's a picture that illustrates the point:
3. Statistics of Dispersion
Measures of Variability
Variability has to do with how far the scores (or values obtained, measured in the experiments) are spread apart. Whereas measures of central tendency are a quantification of the average value of the distribution, measures of variability quantify the extent of dispersion. There are three measures of variability we will be looking at:
The range is defined as the difference between the highest and lowest score in the distribution. The range is easy to calculate but gives us only a relatively crude measure of dispersion, because the range really only measures the spread of the two extreme scores and not the spread of any of the scores in between. (By scores we mean, as usual, observed events, or measurements).
Let
X = (x1, ..., xn)be a list of n experiments.
The variance of the list X is defined as follows:
Alternatively, the variance of X is defined as the mean of the list
that is,
As with the mean, we can define the variance of X in terms of the experimental probabilities of the events:
The variance is not used much in descriptive statistics because it gives us squared units of measurement. It is used, however, quite frequently in inferential statistics.
The standard deviation of a list of experiments X is defined as follows:
The standard deviation is the most frequently encountered measure of variability.
The standard deviation has many important characteristics:
This differs from the range, which tells us directly the spread of the two most extreme scores.
If a score is moved closer to the mean, then the standard deviation will become smaller. Conversely, if a score shifts away from the mean, then the standard deviation will increase.
If samples were taken repeatedly from populations of the type usually encountered in the behavioral sciences, the standard deviation of the samples would vary much less from sample to sample than the range.
4. Correlation
Correlation and regression are very much related. They both involve the relationship between two variables. Regression, however, is primarily concerned with using the relationship for prediction, whereas correlation is concerned primarily with finding out whether a relationship exists, and with determining its magnitude.
Aside from the practical utility of using a relationship for prediction, why would anyone be interested in determining if two variables are related? One important reason is that if the variables are related, it is possible that one of them is the cause of the other.
However, the fact that two variables are related is not sufficient basis for proving causality. Nevertheless, because correlational studies are among the easiest to carry out, showing that a correlation exists between the variables is often the first step toward proving that they are causally related. Conversely, if a correlation does not exist between two variables, a causal relationship can be ruled out.
Another very important use of correlation is in assessing the reliability of testing instruments. Reliability in connection with tests means consistency in scores over repeated administration of the test. Correlational techniques allow us to measure the relationship between the scores derived on the two administrations and, hence, to measure the reliability of the test.
Given two lists, each with n experiments,
X = (x1, ..., xn)and
Y = (y1, ..., yn)it is natural to ask whether there exists a functional relationship between the experiments in X and Y.
The simplest functional relationship is a linear relationship.
Mathematically, we want to find out whether there exist two numbers, a and b, such that for each j
yj = a + bxjIn general, the numbers a and b do not exist. However, we can still set out to find the numbers a and b which give the "best" approximation to this linear relationship.
The correlation coefficient of the lists X and Y is a statistic which provides some insight into how good these best numbers a and b are at establishing a linear relationship between the experiments in X and the experiments in Y. Before we define the correlation coefficient of X and Y, we define the covariance of the lists X and Y.
The covariance of the lists of experiments X and Y is defined as follows:
Note that if X = Y then
We have the following property:Covariance(X, Y) =Covariance(X, X) =Variance(X)
where the list X Y is defined as (x1y1, ..., xnyn)
The correlation coefficient of the lists of experiments X and Y is defined as follows:
We have the following property: if
X = Ythen
CorrelationCoefficient(X, Y) = 1 We also have the following: if there exist a and b such that for each j
yj = a + bxjthat is,
Y = a + bXthen
CorrelationCoefficient(X, Y) is not defined if b = 0
CorrelationCoefficient(X, Y) = 1 if b > 0; and
CorrelationCoefficient(X, Y) = -1 if b < 0
5. Linear Regression
Given two lists, each with n experiments,
X = (x1, ..., xn)and
Y = (y1, ..., yn)it is natural to ask whether there exists a functional relationship between the experiments in X and Y.
The simplest functional relationship is a linear relationship.
Mathematically, we want to find out whether there exist two numbers, a and b, such that for each j
yj = a + bxjIn general, the numbers a and b do not exist. However, we can still set out to find the numbers a and b which give the "best" approximation to this linear relationship.
Linear regression is a technique to determine these best a and b's.
Technically, we want to determine the numbers a and b that minimize the formula:
Much of what we will cover in the chapter about regression will involve finding out these values.
A113