Fall Semester 2002


Lecture Notes Three: Data distributions

At this point I will trust that you have read Chapters 1 and 2 from Kirkup.

This week the reading assignment is Chapter 3 from Kirkup.

And now, let's start bringing in some definitions.

In our discussions this semester we shall be using certain technical terms.

The terms and their definitions will now be given.

Population
A population is the complete set of individuals, objects, or scores that the investigator is interested in studying. In an actual experiment, the population is the larger group of individuals from which the subjects run in the experiment have been taken. (To better understand this and the next definition in the context of random sampling please see the Abraham Wald example that is presented shortly after these defintions, below.)

Sample
A sample is a subset of the population. In an experiment, for economical reasons, the investigator usually collects data on a smaller group of subjects than the entire population. This smaller group is called a sample. (See Wald example, presented shortly, below).

Variable
A variable is any property or characteristic of some event, object, or person that may have different values at different time depending on the conditions. Height, weight, reaction time, and drug dosage are examples of variables. A variable should be contrasted with a constant, which, of course, does not have different values at different times. An example of a constant is the mathematical constant
whose values has been shown here to the 6th decimal.

Independent variable
The independent variable in an experiment is the variable that is systematically manipulated by the investigator. In most experiments, the investigator is interested in determining the effect that one variable, say variable A, has on one or more other variables. To do so, the investigator manipulates the levels of variable A and measures the effect on the other variables. Variable A is called the independent variable because its levels are controlled by the experimenter, independent of any change in the other variables.

To illustrate, an investigator might be interested in the effect of alcohol on social behaviour. In this example, the experimenter is manipulating the amount of alcohol consumed by the subjects and measure its effect on their social behaviour. Alcohol amount is the independent variable.

In another experiment, the effect of sleep deprivation on aggressive behaviour is studied. Subjects are deprived of various amounts of sleep, and the consequences on aggressiveness are observed. Here, the amount of sleep deprivation is manipulated. Hence, it is the independent variable.

Dependent variable
The dependent variable in an experiment is the variable that the investigator measures to determine the effect of the independent variable.

For example, in the experiment studying the effects of alcohol on social behaviour, amount of alcohol is the independent variable. The social behaviour of the subjects is measured to see if it is affected by the amount of alcohol consumed.

In the investigation of sleep deprivation and aggressive behaviour, the amount of sleep deprivation is being manipulated, and the subjects' aggressive behaviour is being measured. Amount of sleep deprivation is the independent variable, and aggressive behaviour is the dependent variable.

Data
The measurements that are made on the subjects of an experiment are called data. Usually, data consist of the measurements of the dependent variable or of other subject characteristics, such as age, sex, number of subjects, etc. The data as originally measured are often referred to as raw or original scores.

Statistic
A statistic is a number calculated on sample data. It quantifies a characteristic of the sample. Thus, the average value of a sample set of scores would be called a statistic.

Parameter
A parameter is a number calculated on population data. It quantifies a characteristic of the population. For example, the average value of a population set of scores is called a parameter. It should be noted that a statistic and a parameter are very similar concepts. The only difference is that a statistic is calculated on a sample and a parameter is calculated on a population. We normally try to estimate parameters based on statistics. (See below).

In research data is often collected on a sample of subjects, rather than on the entire population to which the results are intended to apply. Ideally, of course, the experiment would be performed on the whole population, but usually it is far too costly (time, money, etc.), and so a sample is taken. Note that not just any sample will do (see the Wald example below).

The sample should be a random sample. Random sampling will be discussed later. For now, it is sufficient to know that random sampling allows the laws of probability to apply to the data and at the same time helps achieve a sample that is representative of the population. Thus, the results obtained from the sample should also apply to the population. Once the data is collected, it is statistically analyzed, and the appropriate conclusions are drawn.

And now, let's look at the announced example:

During World War II many economists, mathematicians, and statisticians were members of Columbia University's Statistics Research Group, which did high-level consulting work for the armed services.

As part of this group's work, statistician Abraham Wald was asked where to place armor on planes. It seemed obvious to the aircraft engineers that armor was needed at the places most frequently hit, as found in a large sample of battle-proven airplanes. After studying the bullet holes of a sample of returning planes, Wald's conclusion was to place the armor where bullet holes were least frequently found in these planes, and that's what he recommended.

Now the questions:

  1. Was his reasoning justified?
  2. Was there anything wrong with the aircraft engineers' sampling design?
  3. Did they overlook anything?
Well, we'll think about this, and another example.

ABC's 20/20 television broadcast on July 16, 1993 reported on a study in which individuals who had lived to be 100 years of age or more were queried in the hope of finding common characteristics. The implication was drawn that if a younger person worked at acquiring the characteristics shared by these centenarians, then the probability of reaching such an old age increased.

Why was this study design inappropriate for the implication drawn?

Self-selected samples can be misleading.

Data analysis (or statistical analysis) has been divided into two areas:

Both involve analyzing data. If the analysis is done for the purpose of describing or characterizing the data that have been collected, then we are in the area of descriptive statistics. For example, when we record the scores from an exam, such as the one we talked about last time, we hand the tests back and then we want to describe the scores. We might decide to

  1. calculate the average of the distribution so as to describe its central tendency.

  2. determine its range, so as to characterize its variability.

  3. plot the scores on a graph (histogram) so as to show the shape of the distribution.

Since all of these procedures are for the purpose of describing or characterizing the data already collected, they fall within the realm of descriptive statistics. Inferential statistics, on the other hand, is not concerned with just describing the obtained data. Rather, it embraces techniques that allow one to use obtained sample data to infer to or draw conclusions about populations.

Descriptive Statistics
is concerned with techniques that are used to describe or characterize data.

Inferential Statistics
involves techniques that use the obtained sample data to infer to populations.

We no go back to the situation of last time and discuss

FREQUENCY DISTRIBUTIONS

A frequency distribution
presents the score values and their frequency of occurrence. When presented in a table, the score values are listed in rank order, with the lowest score usually at the bottom of the table.

When there are many scores and the scores range widely, as they do on the fictitious exam that the data presented last time was taken from, listing individual scores results in many values with a frequency of zero and a display from which it is difficult to visualize the shape of the distribution and its central tendency. Under these conditions, the individual scores are usually grouped into class intervals and presented as frequency distribution of grouped scores.

When grouping data, one of the important issues is how wide each interval should be. Whenever data is grouped, some information is lost. The wider the interval, the more information is lost. In practice one usually determines the interval width by dividing the distribution into from 10 to 20 intervals. Or, one could use more complicated formulas, and rules of a thumb, such as the integer that just exceeds

2 N 0.33
where N is the total number of values in the data set.

Let's look at some more definitions.

A relative frequency distribution
indicates the proportion of the total number of scores that occurred in each interval

A cumulative frequency distribution
indicates the number of scores that fell below the upper limit of each interval

A cumulative percentage distribution
indicates the percentage of scores that fell below the upper limit of each interval

PERCENTILES

A percentile or percentile point is the value on the measurement scale below which a specified percentage of the scores in the distribution fall. Percentiles are measures of relative standing. Thus, the 60th percentile point is the value on the measurement scale below which 60% of the scores in the distribution fall. Sometimes we are faced with the situation where we want to know the percentile rank of a raw score. For example, since your score on the exam was 86, it would be useful to you to know the percentile rank of 86.

The percentile rank of a score is the percentage of scores lower than the score in question. This situation is just the reverse of the one where we were calculating the percentile point.

GRAPHING FREQUENCY DISTRIBUTIONS

We have several tools:

  1. The Bar Graph.

    Frequency distributions of nominal or ordinal data are customarily plotted as a bar graph (also as a pie chart). Since there is no numerical relationship between the categories in the nominal data, the various groups can be arranged along the horizontal axis in any order. The bars need not touch each other, as in the case of the histogram. This further emphasizes the lack of quantitative relationship between the categories.

  2. The Histogram.

    The histogram is used to represent frequency distributions composed of interval or ratio data. It resembles the bar graph, except that with the histogram a bar is drawn for each class interval (or bin). The class intervals are plotted on the horizontal axis such that each class bar begins and terminates at the real limits of the interval. The height of the bar corresponds to the frequency of the class interval. Since the intervals are continuous, the vertical bars must touch each other, rather than being spaced apart as is done with the bar graph.

  3. The Frequency Polygon.

    Like a histogram except a point is plotted over the midpoint of each interval at a height corresponding to the frequency of the interval. The points are then joined with straight lines.

  4. The Cumulative Percentage Polygon.

    Cumulative frequency and cumulative percentage distributions may also be presented in graphical form, the latter are more often encountered, used.

SHAPES OF FREQUENCY CURVES

  1. A curve is symmetrical if when folded in half the two sides coincide.

  2. If a curve's not symmetrical, it's skewed (positively or negatively).

Examples:

The curve in the middle is symmetrical. The one on the left is negatively skewed, the one on the right is positively skewed. This will become clearer after we define the three measures of central tendency:

Homework One, that was due last week, tried to help you clearly distinguish the relative merits of each of these three measures of central tendency. (Please try to match these notes with your reading assignments from Kirkup).

Homework Two is based on what will be discussed this week.

(And Homework Three will focus more on the same topic).

Here now are some answers from last year that should help you with the minute papers of today:

Last time we looked at some measures of central tendency.

Let's now take a look at MEASURES OF VARIABILITY

1. The Range.

The range is defined as the difference between the highest and lowest score in the distribution.

2. Deviation Scores.

A deviation score tells how far away the raw score is from the mean of its distribution.

3. The Standard Deviation.

For a population of scores we have:

For a sample we have:
Alternative formula for the standard deviation:
Properties of the standard deviation:

  1. The standard deviation gives us a measure of dispersion relative to the mean. This differs from the range, which tells us directly the spread of the two most extreme scores.

  2. Like the mean, the standard deviation is sensitive to each score in the distribution. If a score is moved closer to the mean, then the standard deviation will become smaller. If a score shifts away from the mean, then the standard deviation will increase.

  3. Like the mean, the standard deviation is stable with regard to sampling fluctuations.

  4. Both the mean and the standard deviation can be manipulated algebraically. This is an important aspect, as it allows mathematics to be done with them for use in inferential statistics.


Last updated: Nov 3, 2002 by Adrian German for A113