| |
CSCI A113
|
BIVARIATE NUMERICAL DATA
A scatterplot is useful for examining the relationship between two numerical variables. In Excel this kind of chart is called an XY (scatter) chart; other names include scatter diagram, scattergram, and XY plot. Such a graphical display is often the first step before fitting a curve to the data using a regression model.
Example. The data shown below were collected in a study of real estate property valuation.
The 15 properties were sold in a particular calendar year in a particular neighborhood in a city stratified into a number of neighborhoods. Although the data displayed is from a single year, similar data is available for each neighborhood for a number of years.SqFt Price 521 26.0 661 31.0 694 37.4 743 34.8 787 39.2 825 38.0 883 39.6 920 31.2 965 37.2 1011 38.4 1047 43.6 1060 44.8 1079 40.6 1164 41.8 1298 45.2
Because we expect that selling price might depend on square feet of living space, selling price becomes the dependent variable and square feet the explanatory variable. Some call the dependent variable the response variable or the y variable. Similarly, other terms for the explanatory variable are predictor variable, independent variable, or the x variable.
Our initial purpose is to visually examine the relationship between the square feet of living space and the selling price of the parcels. Then we will calculate two summary measures, correlation and covariance, using both the analysis tools and functions. Finally, we will include a third variable, assessed value of the property, and use the analysis tool to compute pairwise correlations. In the second part of this set of notes we will fit straight lines and curves to this same data using regression models.
1.1 XY (Scatter) Charts
The following steps describe how to create and embellish a scatterplot using Excel's Chart Wizard.
The chart is embedded on the worksheet. The property data show a general positive relationship; more living space is associated with a higher selling price, on the average. Follow steps 10 through 12 to obtain an embellished scatterplot.
1.2 Analysis Tool: Correlation
The correlation coefficient is a useful sumary measure for bivariate data, in the same sense that the mean and standard deviation are useful summary measures for univariate data. The possible values for the correlation coefficient range from -1 (exact negative correlation, with all points falling on a downward-sloping straight line) through 0 (no linear relationship) to +1 (exact positive correlation, with all points falling on an upward-sloping straight line). The correlation coefficient measures only the amount of straight-line relationship; a strong curvilinear relationship (a U-shaped pattern, for example) might have a correlation coefficient close to zero. The long name for the correlation coefficient is "Pearson product moment correlation coefficient," which is often shortened to simply "correlation".
The following steps describe how to obtain the correlation coefficient using the analysis tool.
The output is a matrix of pairwise correlations. The diagonal values are 1, indicating that each variable has perfect positive correlation with itself. The value 0.814651 is the correlation of Price and SqFt. The upper-right section is blank, because its values would be the same as those in the lower-left section.
The following steps describe how to use Excel's CORREL function to determine the correlation.
The value of the correlation coefficient appears in cell D7. Alternatively, you could have entered the formula
=CORREL(A2:A16,B2:B16)by typing or by a combination of typing and pointing. Unlike the static text output of the analysis tool, the worksheet function is dynamic. If the data values in A2:B16 are changed, the value of the correlation coefficient in cell D7 will change.
1.3 Analysis Tool: Covariance
The covariance is another measure for summarizing the extent of the linear relationship between two numerical variables. Unfortunately, the covariance is difficult to interpret because its measurement units are the product of the units for the two variables. For the selling price and living space data in the example above, the covariance is expressed in units of square feet times thousands of dollars. It is usually preferable to use the correlation coefficient because it is scale-free. However, the covariance is used in finance theory to describe the relationship of one stock price with another.
The covariance computed by the analysis tool is a population covariance: that is, Excel uses n in the denominator, where n is the number of data points.
The following steps describe how to obtain the covariance using the analysis tool.
The output is a matrix of pairwise sample covariances. The diagonal values are sample variances (the square of the sample standard deviation) for each variable. The value 853.2427 is the population covariance of Price and SqFt. The upper-right section is blank, because its values would be the same as those in the lower-left section.
The following steps describe how to use Excel's COVAR function to double-check the population covariance.
The population covariance value appears in cell D16. Alternatively, you could have entered the formula
=COVAR(A2:A16,B2:B16)by typing or by a combination of typing and pointing. If the data values in A2:B16 are changed, the population covariance value in cell D16 will change. The covariance computed by Excel's COVAR function uses n in the denominator. To find the sample covariance multiply by n/(n - 1)
In this example, n=15, so the sample covariance is 15/14 * 853.2427 = 914.1886
1.4 Correlations for Several Variables
The Correlation analysis tool is most useful for determining pairwise correlations for three or more variables, often as an aid to selecting variables for a multiple regression model. The following steps describe how to obtain correlations for several variables.
SqFt Assessed Price 521 7.8 26.0 661 23.8 31.0 694 28.0 37.4 743 26.2 34.8 787 22.4 39.2 825 28.2 38.0 883 25.8 39.6 920 20.8 31.2 965 14.6 37.2 1011 26.0 38.4 1047 30.0 43.6 1060 29.2 44.8 1079 24.2 40.6 1164 29.4 41.8 1298 23.6 45.2
The output shows three pairwise correlations. The highest correlation, 0.814651, is between SqFt and Price. The correlation between Assessed and Price, 0.67537, is smaller, indicating less of a linear relationship between these two variables. The lowest correlation, 0.424219, is between SqFt and Assessed.
If we must use a single explanatory variable to predict selling price in a linear regression model, these correlations suggest that SqFt is a better candidate than Assessed, because 0.814651 is higher than 0.67537. If we can use two explanatory variables to predict selling price in a multiple regression model, both SqFt and Assessed should be useful, and there shouldn't be a problem with multicollinearity because the correlation between these two explanatory variables is only 0.424219.
SIMPLE LINEAR REGRESSION
Single linear regression can be used to determine a straight-line equation describing the average relationship between two variables. Three methods are described here: the Add Trendline command, the Regression analysis tool, and Excel functions. Before fitting a line, it is important to examine a scatterplot as described in the first part of these notes. If the points on the scatterplot fall approximately on a straight line, the methods described herein are appropriate. If the points fall on a curve or have another pattern one should consider simple nonlinear regression methods (which are not discussed here).
The data analyzed in this chapter consists of selling price and living space for 15 real estate properties described in the example above. Because we expect that selling price might depend on square feet of living space, selling price becomes the dependent variable and square feet the explanatory variable. Some call the dependent variable the response variable or the y variable. Similarly, other terms for the explanatory variable are predictor variable, independent variable, or the x variable.
The first step is to examine the relationship between selling price, in thousands of dollars, and living space, in square feet, by constructing a scatterplot. The general approach is to arrange the data so that the x variable for the horizontal axis is in a column on the left and the y variable for the vertical axis is in a column on the right. Then select the data excluding the labels, click the Chart Wizard tool, and follow the steps for an XY (scatter) chart. Details of these steps with subsequent rescaling and formatting have been described above. The results should match what is presented in the sample spreadsheet that can be downloaded following the link at the bottom of this page, where the chart title is Arial 10 bold and the axes and axis titles are Arial 8 (worksheet 1).
2.1 Inserting a Linear Trendline
The points in the scatterplot follow an approximate straight line, so a linear trendline is appropriate. The method of ordinary least squares determines the intercept and slope for the linear trendline such that the sum of the squared vertical distances between the actual y values and the line is as small as possible. Such a line is often called the line of average relationship. The following steps describe inserting a linear trendline on the scatterplot and formatting the results.
Trendline Interpretation
We can answer the question "What is the average relationship?" by examining the fitted equation y = 0.021x + 18.789, which may be written as
Predicted Price = 18.789 + 0.021 * SqFtThe y-intercept or constant term in the equation is 18.789, measured in the same units as the y variable. Naively, the constant term says that that a property with zero square feet of living space has a selling price of 18.789 thousands of dollars. However, there are no properties with fewer than 521 square feet in our data, so this constant can be considered a starting point that is relevant for properties with living space between 521 and 1,298 square feet.
The slope or regression coefficient, 0.021, indicates the average change in the y variable for a unit change in the x variable. The measurement units in this example are 0.021 thousands of dollars per square foot, or $21 per square foot. If two properties differ by 100 square feet of living space, we expect the selling prices to differ by 0.021 * 100 = 2.1 thousands of dollars, or $2,100.
One popular way to answer the question "How good is the relationship?" is to examine the value for R2, which measures the proportion of variation in the dependent variable, y, that is explained using the x variable and the regression line. Here the R2 value of 0.6637 indicates that approximately 66% of the variation in selling prices can be explained by a linear model using living space. Perhaps the remaining 34% of the variation can be explained using other property characteristics in a multiple regression model.
Trendline Embellishments
If the equation displayed on the chart is used to calculate predicted selling prices, the results may be imprecise because the intercept and slope have only three decimal places. To display more decimal places, double click the chart to activate it and click on the region containing the equation and R2 value to select them for editing. Then click the Increase Decimal tool repeatedly to display more decimal places. (In the model below we used five decimals.) These changes affect both the equation and R2 value, and these changes must be made before any other editing.
With the equation and R2 value selected, you can move the entire text box by clicking and dragging near the edge of the box, and you can use the regular text editing options for rearranging the text. In our model you can see the result of such editing: variable names were substituted for x and y, terms were rearranged, and the last three significant figures of R2 were deleted. Once you begin any such editing, you are unable to use the Increase Decimal or Decrease Decimal tools to change the displayed precision.
2.2 Regression Analysis Tool
The Add Trendline command provides only the fitted line, equation and R2. To obtain additional information for assessing the relationship between the two variables, follow these steps to use the Regression Analysis tool.
In the Regression dialog box, move from box to box using the mouse or the tab key. For a box requiring a range, select the box and then select the appropriate range on the worksheet by pointing. To see cells on the worksheet, move the Regression dialog box by clicking on its title bar and dragging, or click the collapse button on the right side of each range edit box. Click the Help button for additional information.
Regression Interpretation
The intercept and the slope of the fitted regression line are in the lower-left section labeled "Coefficients" of the summary output in the model spreadsheet. The Intercept coefficient 18.7894675 is the constant term in the linear regression equation, and the SqFt coefficient 0.02101025 is the slope. The regression equation is
Predicted Price = 18.7894675 + 0.02101025 * SqFtRefer to the Trendline Interpretation section above for an explanation of the intercept and slope.
In the residual output shown in the model spreadsheet below, the predicted prices, sometimes termed the fitted values, are the result of estimating the selling price of each property using this regression equation. The residuals are the difference between the actual and fitted values. For example, the first property has 521 square feet. On the average, we would expect this property to have a selling price of $29,736, but its actual selling price is $26,000. The residual for this property is $26,000-$29,736 -- that is, -$3,736. Its actual selling price is $3,736 below what is expected. The residuals are also termed deviations or errors.
The four most common measures to answer the question "How good is the relationship?" are the standard error, R2, t statistics, and analysis of variance. The standard error, 3.23777441, shown in cell I17 in the model spreadsheet below, is expressed in the same units as the dependent variable, selling price. As the standard deviation of the residuals, it measures the scatter of the actual selling prices around the regression line. This summary of the residuals is $3,238. The standard error is often called the standard error of the estimate.
R The t statistics, shown in cells G17:G18 in the model spreadsheet are part of the individual hypothesis tests of the regression coefficients. For example, these 15 properties could be treated as a sample from a larger population. Thu null hypothesis is that there is no relationship: the population regression coefficient for living space is zero, implying that differences in living space don't affect selling price. With a sample regression coefficient of 0.02101025 and a standard error of the coefficient (an estimate of the sampling error) of 0.004148397, the coefficient is 5.064667 standard errors from zero. The two-tail p-value, 0.000217, shown in cell H18, is the probability of obtaining these results, or something more extreme, assuming the null hypothesis is true. Therefore we reject the null hypothesis and conclude that there is a significant relationship between selling price and living space.
The analysis of variance table, shown in cells D10:I14 in the model spreadsheet, is a test of the overall fit of the regression equation. Because it summarizes a test of the null hypothesis thhat all regression coefficients are zero, it can only be discussed in a chapter with multiple regression (not here).
Regression Charts
For simple linear regression the analysis tool provides two charts: residual plot and line fit plot. These charts are embedded near the top of the worksheet to the right of the summary output. In the real estate properties example, the charts are originally located in cells M1:S12; after relocating the residuals, the charts are in cells P1:V12.
The line fit plot is shown in the model spreadsheet below. This chart is similar to the scatterplot with inserted trendline, except that the predicted values in this chart are markers without a line. The following steps describe how to format the line fit plot.
"=SERIES("Predicted Price", ...)"appears in the formula bar. Right-click, choose Format Data Series from the shortcut menu, and click the Patterns tab. Select Automatic for Line and select None for Marker. Then click OK.
The residual plot (after resizing to approximately 6 columns by 14 rows) is shown in the model spreadsheet. This type of chart is useful for determining whether the functional form of the fitted line is appropriate. If the residual plot is a random pattern, the linear fitted line is satisfactory; if the residual plot shows a pattern, additional modeling may be needed. When there is only one x variable (simple regression), the residual plot provides a view that is similar to making the fitted line horizontal. When there are several x variables (multiple regression), the residual plot is an even more valuable tool for checking model adequacy, because there is usually no way to view the fitted equation in three or more dimensions.
2.3 Regression Functions
A third method for obtaining regression results is worksheet functions. Five functions described here are appropriate for simple regression (one x variable), and four of these have identical syntax for their arguments. For example, the syntax for the INTERCEPT function is
INTERCEPT(known_y's, known_x's)The same syntax applies to the SLOPE, RSQ (R square), and STEYX (standard error of estimate). These four functions are entered in cells H2:H5 (as seen in the model spreadsheet) and the values returned by these functions are shown in cells F2:F5.
To prepare the model spreadsheet Regression Using Functions output, the function results in column H are copied to the clipboard (Edit | Copy), and the values are pasted into column F (Edit | Paste Special | Values). The formulas are displayed in column H by choosing Options from the Tools menu, clicking the View tab, and checking the Formulas checkbox in the Window Option section.
Cells H9 and H11 show two methods for obtaining a predicted selling price for a property with 1,000 square feet of living space. If the intercept and the slope of the regression equation have already been calculated, the formula
"=intercept + slope * x"
can be entered into a cell (H9) using appropriate cell references. Here the predicted selling price is 39.7997169881321. in thousands of dollars, or approximately $39,800.
Another method for obtaining a predicted value based on simple linear regression is the FORECAST function, with syntax
FORECAST(x, known_y's, known_x's)This method, shown in cell H11, calculates the intercept and slope using least squares and returns the predicted value of y for the specified value of x.
Yet another method for obtaining predicted y values is the TREND function, which has the following syntax:
TREND(known_y's, known_x's, new_x's, const)This function, unlike the FORECAST function, can also be used for multiple regression (two or more x variables). Because the TREND function is an array function, it must be entered in a special way, as descried in the following steps.
A companion function, LINEST, provides regression coefficients, standard errors, and other summary measures. Like TREND, this function can be used for multiple regression (two or more x variables) and must be array-entered. Its syntax is
LINEST(known_y's, known_x's, const, stats)The "const" and "stats" arguments are true-or-false values, where "const" specifies whether the fitted equation has an intercept term and "stats" indicates whether summary statistics are desired.
To obtain the results shown in the model spreadsheet, select D1:E5, type or use the Paste Function tool to enter LINEST, press F2, and finally hold down the Control and Shift keys while you press Enter. Cells D7:E11 show the numerical results that appear in cells D1:E5, and cells D13:E17 describe the contents of those cells. These same values appear with labels in the Regression analysis tool summary output shown in the model spreadsheet below.
These notes are 97% complete (but the last part has not been double-checked for cell by
cell reference accuracy). So expect to see some differences with the model spreadsheet that you should be able to
download following this link.
A113