Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display mechanism for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a probe. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research investigate is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot demonstrating the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Ordinary
- In the text box under Y Variables inject Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Plain Scatter and then click Define.
- Click on variable Final and come in this in the Y_Axis box.
- Click the variable Quiz Average and inject this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not emerge to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘ideal’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not flawless. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct response?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Master Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as petite as possible. To accomplish this objective of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals shows up as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response inject the variable Final
- In the window box by Predictors inject the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and inject this in the Dependent box.
- Click the variable Quiz Average and come in this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you string up your mouse over various parts of the output in Minitab pop-ups will emerge with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Recall that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The objective in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technology for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a probe. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research probe is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot demonstrating the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Elementary
- In the text box under Y Variables come in Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Elementary Scatter and then click Define.
- Click on variable Final and come in this in the Y_Axis box.
- Click the variable Quiz Average and come in this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not emerge to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘ideal’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not flawless. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct reaction?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Master Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as petite as possible. To accomplish this aim of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals shows up as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response come in the variable Final
- In the window box by Predictors come in the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and inject this in the Dependent box.
- Click the variable Quiz Average and come in this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you dangle your mouse over various parts of the output in Minitab pop-ups will show up with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Reminisce that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The objective in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technology for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a investigate. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research investigate is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot demonstrating the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Ordinary
- In the text box under Y Variables inject Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Ordinary Scatter and then click Define.
- Click on variable Final and inject this in the Y_Axis box.
- Click the variable Quiz Average and come in this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not show up to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘ideal’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not ideal. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct reaction?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Master Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as petite as possible. To accomplish this objective of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals emerges as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response inject the variable Final
- In the window box by Predictors come in the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and inject this in the Dependent box.
- Click the variable Quiz Average and come in this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you drape your mouse over various parts of the output in Minitab pop-ups will emerge with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Recall that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The aim in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display mechanism for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a investigate. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research explore is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (powerless, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showcasing the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Elementary
- In the text box under Y Variables inject Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Elementary Scatter and then click Define.
- Click on variable Final and inject this in the Y_Axis box.
- Click the variable Quiz Average and come in this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not emerge to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘flawless’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not ideal. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct reaction?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Tormentor Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as puny as possible. To accomplish this aim of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals emerges as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response come in the variable Final
- In the window box by Predictors come in the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and come in this in the Dependent box.
- Click the variable Quiz Average and come in this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you drape your mouse over various parts of the output in Minitab pop-ups will emerge with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Reminisce that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The objective in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display mechanism for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a explore. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research explore is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot demonstrating the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Elementary
- In the text box under Y Variables come in Final and under X Variables come in Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Elementary Scatter and then click Define.
- Click on variable Final and come in this in the Y_Axis box.
- Click the variable Quiz Average and inject this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not show up to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘flawless’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not ideal. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct reaction?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Master Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as petite as possible. To accomplish this aim of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals emerges as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response inject the variable Final
- In the window box by Predictors come in the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and come in this in the Dependent box.
- Click the variable Quiz Average and inject this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you suspend your mouse over various parts of the output in Minitab pop-ups will show up with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Reminisce that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The purpose in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display mechanism for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a examine. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research examine is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot demonstrating the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Plain
- In the text box under Y Variables come in Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Elementary Scatter and then click Define.
- Click on variable Final and come in this in the Y_Axis box.
- Click the variable Quiz Average and come in this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not show up to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘ideal’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not ideal. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct response?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Master Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as petite as possible. To accomplish this purpose of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals emerges as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response come in the variable Final
- In the window box by Predictors come in the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and come in this in the Dependent box.
- Click the variable Quiz Average and inject this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you dangle your mouse over various parts of the output in Minitab pop-ups will show up with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Reminisce that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The aim in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technology for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a investigate. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research explore is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (feeble, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showcasing the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Plain
- In the text box under Y Variables inject Final and under X Variables inject Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Ordinary Scatter and then click Define.
- Click on variable Final and inject this in the Y_Axis box.
- Click the variable Quiz Average and come in this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not show up to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘flawless’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not flawless. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct response?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Tormentor Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as puny as possible. To accomplish this objective of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals emerges as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response come in the variable Final
- In the window box by Predictors inject the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and come in this in the Dependent box.
- Click the variable Quiz Average and come in this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you dangle your mouse over various parts of the output in Minitab pop-ups will emerge with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Recall that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The purpose in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.
Relationships Inbetween Two Variables
Relationships Inbetween Two Variables
As we did when considering only one variable, we begin with a graphical display. A scatterplot is the most useful display technology for comparing two quantitative variables. We plot on the y-axis the variable we consider the response variable and on the x-axis we place the explanatory or predictor variable.
How do we determine which variable is which? In general, the explanatory variable attempts to explain, or predict, the observed outcome. The response variable measures the outcome of a investigate. One may even consider exploring whether one variable causes the variation in another variable – for example, a popular research explore is that taller people are more likely to receive higher salaries. In this case, Height would be the explanatory variable used to explain the variation in the response variable Salaries.
In summarizing the relationship inbetween two quantitative variables, we need to consider:
- Association/Direction (i.e. positive or negative)
- Form (i.e. linear or non-linear)
- Strength (powerless, moderate, strong)
Example
We will refer to the Exam Data set, (Final.MTW or Final.XLS), that consists of random sample of fifty students who took Stat200 last semester. The data consists of their semester average on mastery quizzes and their score on the final exam. We construct a scatterplot showcasing the relationship inbetween Quiz Average (explanatory or predictor variable) and Final (response variable). Thus, we are studying whether student spectacle on the mastery quizzes explains the variation in their final exam score. That is, can mastery quiz spectacle be considered a predictor of final exam score? We create this graph using either Minitab or SPSS:
- Opening the Exam Data set.
- From the menu bar select Graph > Scatterplot > Plain
- In the text box under Y Variables inject Final and under X Variables come in Quiz Average
- Click OK
To create a scatterplot in SPSS:
- Import the data set
- From the menu bar select Graphs > Legacy Dialogs > Scatter/Dot
- Select the square Plain Scatter and then click Define.
- Click on variable Final and inject this in the Y_Axis box.
- Click the variable Quiz Average and inject this in the X_Axis box.
- Click OK
This should result in the following scatterplot:
Association/Direction and Form
We can interpret from either graph that there is a positive association inbetween Quiz Average and Final: low values of quiz average are accompanied by lower final scores and the same for higher quiz and final scores. If this relationship were reversed, high quizzes with low finals, then the graph would have displayed a negative association. That is, the points in the graph would have decreased going from right to left.
The scatterplot can also be used to provide a description of the form. From this example we can see that the relationship is linear. That is, there does not show up to be a switch in the direction in the relationship.
Strength
In order to measure the strength of a linear relationship inbetween two quantitative variables we use correlation. Correlation is the measure of the strength of a linear relationship. We calculate correlation in Minitab by (using the Exam Data):
- From the menu bar select Stat > Basic Statistics > Correlation
- In the window box under Variables Final and Quiz Average
- Click OK (for now we will disregard the p-value in the output)
The output gives us a Pearson Correlation of 0.609
Correlation Properties (NOTE: the symbol for correlation is r)
- Correlation is unit free. If we switched the final exam scores from percents to decimals the correlation would remain the same.
- Correlation, r, is limited to – one ≤ r ≤ 1.
- For a positive association, r > 0; for a negative association r < 0.
- Correlation, r, measures the linear association inbetween two quantitative variables.
- Correlation measures the strength of a linear relationship only. (See the following Scatterplot for display where the correlation is zero but the two variables are obviously related.)
- The closer r is to zero the weaker the relationship; the closer to one or – one the stronger the relationship. The sign of the correlation provides direction only.
- Correlation can be affected by outliers
Equations of Straight Lines: Review
The equation of a straight line is given by y = a + bx. When x = 0, y = a, the intercept of the line; b is the slope of the line: it measures the switch in y per unit switch in x.
For the ‘Data 1’ the equation is y = three + 2x ; the intercept is three and the slope is Two. The line slopes upward, indicating a positive relationship inbetween x and y.
For the ‘Data Two’ the equation is y = thirteen – 2x ; the intercept is thirteen and the slope is -2. The line slopes downward, indicating a negative relationship inbetween x and y.
The relationship inbetween x and y is ‘flawless’ for these two examples–the points fall exactly on a straight line or the value of y is determined exactly by the value of x. Our interest will be worried with relationships inbetween two variables which are not ideal. The ‘Correlation’ inbetween x and y is r = 1.00 for the values of x and y on the left and r = -1.00 for the values of x and y on the right.
Regression analysis is worried with finding the ‘best’ fitting line for predicting the average value of a response variable y using a predictor variable x.
Here is an applet developed by the folks at Rice University called “Regression by Eye”. The object here is to give you a chance to draw what you this is the ‘best fitting line”.
no applet support
Click the Begin button and draw your best regression line through the data. You may repeat this procedure several times. As you draw these lines, how do you determine which line is better? Click the Draw Regression line box and the correct regression line is plotted for you. How would you quantify how close your line is to the correct reaction?
Least Squares Regression
The best description of many relationships inbetween two quantitative variables can be achieved using a straight line. In statistics, this line is referred to as a regression line. Historically, this term is associated with Tormentor Francis Galton who in the mid 1800`s studied the phenomenon that children of tall parents tended to «regress» toward mediocrity.
Adjusting the algebraic line expression, the regression line is written as:
Here, bo is the y-intercept and b1 is the slope of the regression line.
Some questions to consider are:
- Is there only one «best» line?
- If so, how is this line found?
- Assuming we have decently fitted a line to the data, what does this line tell us?
By answering the third question we should build up insight into the very first two questions.
We use the regression line to predict a value of for any given value of X. The «best» line would make the best predictions: the observed y-values should stray as little as possible from the line. The vertical distances from the observed values to their predicted counterparts on the line are called residuals and these residuals are referred to as the errors in predicting y. As in any prediction or estimation process you want these errors to be as puny as possible. To accomplish this aim of minimum error, we select the method of least squares: that is, we minimize the sum of the squared residuals. Mathematically, the residuals and sum of squared residuals shows up as goes after:
Sum of squared residuals:
A unique solution is provided through calculus (not shown!), assuring us that there is in fact one best line. Calculus solutions result in the following calculations for bo and b1:
Another way of looking at the least squares regression line is that when x takes its mean value then y should also takes its mean value. That is, the regression line should always pass through the point . As to the other expressions in the slope equation, Sy refers to the square root of the sum of squared deviations inbetween the observed values of y and mean of y; similarly, Sx refers to the square root of the sum of squared deviations inbetween the observed values of x and the mean of x.
To perform a regression on the Exam Data we can use either Minitab or SPSS:
- From the menu bar select Stat > Regression > Regression
- In the window box by Response inject the variable Final
- In the window box by Predictors inject the variable Quiz Average
- Click the Storage button and select Residuals and Fits (you do not have to do this in order to calculate the line in Minitab, but we are doing this here for further explanation)
- Click OK and OK again.
Plus the following is the very first five rows of the data in the worksheet:
To perform a regression analysis in SPSS:
- Import the data set
- From the menu bar select Analyze > Regression > Linear
- Click on variable Final and inject this in the Dependent box.
- Click the variable Quiz Average and inject this in the Independent box.
- Click OK
This should result in the following regression output:
WOW! This is fairly a bit of output. We will take this data apart and you will see that these results are not too complicated. Also, if you drape your mouse over various parts of the output in Minitab pop-ups will emerge with explanations.
The Output
From the output we see:
- Fitted equation is «Final = 12.1 + 0.751 Quiz Average».
- A value of R-square = 37.0% which is the coefficient of determination (more on that later) which if we take the square root of 0.37 we get 0.608 which is the correlation value that we found previously for this data set.
NOTE: Recall that the square root of a value can be positive or negative (think of the square root of Two). Thus the sign of the correlation is related to the sign of the slope.
For example, if we substitute the very first Quiz Average of 84.44 into the regression equation we get: Final = 12.1 + 0.751*(84.44) = 75.5598 which is the very first value in the FITS column. Using this value, we can compute the very first residual under RESI by taking the difference inbetween the observed y and this fitted : ninety – 75.5598 = 14.4402. Similar calculations are continued to produce the remaining fitted values and residuals.
Coefficient of Determination, R Two
The values of the response variable vary in regression problems (think of how not all people of the same height have the same weight), in which we attempt to predict the value of y from the explanatory variable x. The amount of variation in the response variable that can be explained (i.e. accounted for) by the explanatory variable is denoted by R two . In our Exam Data example this value is 37% meaning that 37% of the variation in the Final averages can be explained (now you know why this is also referred to as an explanatory variable) by the Quiz Averages. Since this value is in the output and is related to the correlation we mention R two now; we will take a further look at this statistic in a future lesson.
Residuals or Prediction Error
As with most predictions about anything you expect there to be some error, that is you expect the prediction to not be exactly correct (e.g. when predicting the final voting percentage you would expect the prediction to be accurate but not necessarily the exact final voting percentage). Also, in regression, usually not every X variable has the same Y variable as we mentioned earlier regarding that not every person with the same height (x-variable) would have the same weight (y-variable). These errors in regression predictions are called prediction error or residuals. The residuals are calculated by taking the observed Y-value minus its corresponding predicted Y-value or . Therefore we would have as many residuals as we do y observations. The purpose in least squares regression is to select the line that minimizes these residuals: in essence we create a best fit line that has the least amount of error.
© two thousand eight The Pennsylvania State University. All rights reserved.