# Applied Statistics

## Contents |

## Introduction

Welcome to the Applied Statistics Wiki Page!

This page is meant to be edited by the students of Applied Business Statistics, mainly Dr. Yu's classes but everyone else is invited as well. Hopefully this page will help those in the class and those that come after us. The more people that add information, the better aid this will be to study for exams and better understand the material.

Good luck to everyone!

## Notes

### Chapter 7 - Multiple Regression

- IV = Independent Variable DV = Dependent Variable

#### Regression Analysis

- Definition of Regression Analysis - a statistical technique that can be used to describe, explain, and predict the relationship between one variable and a set of other variables.

- Types/Classifications of Regression Analysis - there are many classifications, but we will focus on the following ones.

- Simple Regression - 1 DV and 1 IV.

- Multiple Regression - 1 DV and 2 or more IVs.

##### Simple Regression

- Examples:
- Y = a + bx. This is a Simple Regression Example.
- Real World Example - none given.

- Y = a + bx. This is a Simple Regression Example.

- Examples:

- Y = a + bx1 + bx2. This is a Multiple Regression Example.
- Real World Example - Odds of winning (Y) is determined by the team stats of: wins in previous season (bx1) and number of injured players (bx2).

- Y = a + bx1 + bx2. This is a Multiple Regression Example.

- Application of Regression Analysis Issues:

- 1. The Goodness of Fit - the quality of the model. If the model is no good, then you will get poor results which are unreliable. If the model is good, then you will get good results which are reliable. How do we find out if we built a good model? After the information is in, then:

- 1a. Look at the T-Test to test every coefficient. If you pass the T-Test, that means that the coefficient is good, which means that particular variable is good. So you can use that variable to test what you are looking for. If you fail the test, that means the coefficient is no good, which implies that the variable itself is valid/suitable. You must look at this test for each explanatory variables. If all of them failed, then you don't go any further. But if not all of them fail, then you would continue on to the F-Test.

- 1b. Look at the F-Test this must be performed once for the entire model. If you pass the F-Test, then the model is good. If you fail the F-Test, then that means you have a problem with the model design.

- 1c. R^2 = SSR/SST. This measures the goodness of fit. SSR = Sum of Square Regression, SST = Sum of Squares Total. This means the amount of variation/behavior, of what you are trying to measure, explained by the model relative to the total variation. R^2 shows that XXX% of the model can be used to explain XXX% of Y (see equation above). The goal is to be at 1, the closer to 1, the better the model; however, what is acceptable is subject to whoever is creating it and what they are willing to accept as a "good" measure.

- SSE = Sum of Squares Error, this is where the percentage not used in R^2 is located.
- In psychology, sometimes a R^2 of 30% is acceptable; however, in another field, a R^2 is unacceptable.
- R^2 is not deterministic to the quality of the model.

- 2. Auto-Correlation - we will not be tested on this, but we do need to understand the concept. When you are using time-series data (data arranged over a period of time), some data is correlated whether you want it to be or not. Something happened in on earlier data set that contributed to a later data set.

- 3. Multicollinearity - a problem that occurs when moderate to high intercorrelations exist among predictor variables (IVs) to be used in a regression analysis.

- Example - If a very simple economy only produces Markers, then:
- GDP of 2009 = Price $1.00 per marker, Quantity of Markers Produced = 100. So GDP = 1 * 100 = $100.
- GDP of 2010 = Price $2.00 per marker, Quantity of Markers Produced = 100. So GDP = 2 * 100 = $200.
- If we were looking to see the difference in the GDP, we would think that the GDP went up; however, it didn't really, there was just inflation effecting the price of the markers.

- Example - If a very simple economy only produces Markers, then:

- Steps to Building a Regression Analysis
- Step 1 - Identify the set of variables that will affect the DV, these are called
**explanatory variables**.

- Step 1 - Identify the set of variables that will affect the DV, these are called

##### Multiple Regression

Multiple Regression is basically the same except has more IVs. Again the equation for a Multiple Regression is: Y = a + bx1 + bx2 + ... + bxk. k = the last IV in your test.

- Looking at the first example of the handout from today, we find the following:

- Step 1 - Identifying the explanatory variables (which means identifying the IV and DV(s) and the relationship between them).
- Sales = a + b1TP + b2PCI + Et
- TP = Target Population, PCI = Per Capita Income, Et = Sum of Square Error (SSE)

- This equation means that Sales is our DV and TP is our first DV and PCI is our second DV.
- We would expect the relationship between the TP and Sales would be a positive relationship, i.e. when the size of the population goes up, the sales will go up. The same can be said for PCI and Sales, as PCI goes up, we expect Sales to go up, so we expect this to be a positive relationship as well.

- Sales = a + b1TP + b2PCI + Et
- Ceteris Paribas = "all else being equal". This is relevant because we are only looking at one variable and we expect everything else to stay the same.
- So in this equation, we are saying that Sales will go up when TP goes up ceteris paribas (or in other words, assuming that PCI stays the same). We would also say the same for PCI. We would expect Sales to go up when PCI goes up ceteris paribas (or assuming that TP stays the same).

- Step 2 - Collect the data and input it.
- Step 3 - Run Regression analysis.
- Step 4 - Then look at the data and:
- 1. State the hypothesis.
- 1a. In this example: Ho is equal to 0 (which means that there is no relationship between sales and TP), Ha is not equal to 0 for TP. Ho is equal to 0 (which means that there is no relationship between sales and PCI), Ha is not equal to 0 for PCI.

- 2. Set beta.
- 2a. In this example, beta = 0.05.

- 3. Setup rules.
- 3a. In this example: If t-cal > t-table OR t-cal < -t-table then reject Ho. Or we can use the alternative approach since we are not using the t-table. The alternative approach is the p-value approach.
- 3b. In this example: If p-value < beta then reject Ho. If you reject Ho, then you are saying that b1 is not equal to 0 and that there is a relationship between sales and b1. If you fail to reject the Ho, then you are saying that b1 is equal to 0 and there is no relationship between sales and b1. We can be confident with this through the p-value.

- 4. Once you have the data results, then you would perform the T-Test, F-Test, and R^2.

- 1. State the hypothesis.

- If, in this example, we reject TP then we are saying that TP is a bad choice for a IV. So we could:
- 1. Drop TP and use PCI as our sole variable.
- 2. We can replace TP with another variable. For example, we can replace TP with the overall economy standing or something of that nature.
- 3. (We don't need to know this one - just bonus info)We can also transform the data for TP into a different form which would be a manipulation of the data.
- 4. (We don't need to know this one - just bonus info)We could add more observation which may show a correlation.

### Chapter 8

- Path Analysis - a special form of regression analysis, but the way you specify the model, which is different from the traditional regression analysis, will allow you to estimate/identify the causality ("a" causes "b" to occur). With traditional regression analysis all we can say is that there is some sort of correlation, but we cannot really say that "a" causes "b" to occur.

- Types of Path Analysis:

- 1. Path Analysis - this is our focus. There are two types:

- a. Direct Causal Effect Example:

- Male Life Expectancy (MLE) = f(location, economy, # of doctors)
- Z4 = P41Z1 + P42Z2 + P43Z3 + e4
- Z4 = MLE, Z1 = Location, Z2 = Economy, Z3 = # of doctors, P41-P43 = Coefficients, e4 = error.

- Path Diagram - see page 193 and use the information in this example. Basically this is just putting the equation(s) in diagram form.

- b. Indirect Causal Effect Example:

- From the Direct Causal Effect Example, we want to add the Indirect Causal Effects. So it changes to:
- Z1 = Location, Z2 = Economy, Z3 = Death rate, Z4 = # of Doctors, Z5 = MLE which changes the equation to:

- From the Direct Causal Effect Example, we want to add the Indirect Causal Effects. So it changes to:

- Z3 = P31Z1 + P32Z2 + e3 T
- This is stating that Location and Economy directly effect Death Rate and indirectly effect MLE.

- Z4 = P42Z2 + e4
- This is stating that the Economy directly effects the # of Doctors and indirectly effect MLE.

- Z5 = P53Z3 + P54Z4 + e5
- This is stating that the Death Rate and # of Doctors will have a direct impact on MLE.

- Z3 = P31Z1 + P32Z2 + e3 T

- Path Diagram - see page 194 and use the information in this example. Basically this is just putting the equation in diagram form.

- 2. Structural Equation Modeling (SEM) - we will not be talking about this one very much. It will provide a better fit and it can be used to measure latent variables. A disadvantage is that the standard SPSS package cannot be used to generate the computer result (a specialized package must be used).

- Latent Variables - descriptive variables that cannot be directly measured. Example: Intelligence, beauty, taste, etc.

- Assumptions for Path Analysis (pg 196-197):

- 1. The model must accurately reflect causal sequence. This means that we must know what we are doing, for example in the previous example, we know that the MLE is effected by the death rate and the # of doctors.
- 2. The structural equation for each endogenous variable includes all variables that are direct causes of that particular endogenous variable. This means that we use all variables that effect the variable that we are looking at.
- 3. There is a one-way causal flow in the model. This means that, from our example, that MLE is caused by the # of Doctors, but the # of Doctors is not caused by MLE.
- 4. The relationship among variables are assumed to be linear, additive, and causal in nature.
- 5. All exogenous variables are measured without error. Means that we collected all of the data in the correct way. Example: Incorrect sample size, outliers, etc.

- Limitations for Path Analysis (pg 197)

- Misspecification - a model is not consistent with the empirical data. This is measured in degrees and is subjective and must be evaluated by the researcher.

- Path Tracing and Legitimate Path (198)

- Path Tracing - Determining the reproduced correlations between two variables involves the identification of all legitimate paths between the variables in the model in a process. This is the process that results in a correlation coefficient for each path, which is equal to the product of all coefficients in the path. A key is that one may only use legitimate paths, which are those paths that do not violate any of the following 3 rules:

- 1. No path may pass through the same variable more than once.
- 2. No path may go backward on an arrow after going forward on another arrow.
- 3. No path may include more than one double-headed curved arrow.

- Labels (pg 200):
- D = "Direct". A causal path consisting of only one link.
- I = "Indirect". Consisting of two or more links.
- S = "Spurious Effects". Any path components resulting from paths that have reversed causal direction at some point. This indicates that the relationship is caused by a common third factor and may or may not include a double-headed curved arrow. Any path between two endogenous variables which includes a curved arrow will always represent a spurious effect.
- U = "Unanalyzed Portion". In any model that contains 3 or more exogenous variables, the associated unexplained correlations among them will result in a degree of undeterminability with respect to the resolution of the direct and indirect effects of exogenous variables on endogenous variables. This may represent some degree of causal effect that has not been included in the model.

- Writing up the Results (pg 212-213)

The paragraph at the bottom of page 212, ending on page 213, is a written version of the steps for writing up the results.

- 1. Present Initial Model - variables and flow. Summarize initial model in path diagram.
- 2. Describe any data elimination and/or transformation.
- 3. Discuss significance of path coefficients. Present path coefficients in path diagram.
- 4. Describe how reproduced correlations were not consistent with empirical correlations. Create table that compares empirical correlations to reproduced correlations for the initial model.
- 5. Describe process of revising model.
- 6. Present revised model: variables, flow, and significant path coefficients. Summarized revised model in path diagram (including path coefficients).
- 7. Describe how reproduced correlations were consistent with empirical correlations. Create table that compares empirical correlations to reproduced correlations for the revised model.
- 8. Discuss causal effects for each endogenous variable: total causal effects and R^2. Create table of causal effects (direct, indirect, and total) for each endogenous variable.

- Decomposition of the Model (pg 201) - look at Table 1, we'll need to know this.

After you run the model, you'll have your reproduced data. You will then compare the path of the reproduced and the initial path (Initial Path - Reproduced Path). If the difference is greater than .05 then the model will need to be revised.

### Chapter 9

- Factor Analysis - Statistical technique that can be used to determine the underlying structure.

- Example - College Enrollment

- Factors that effect college enrollment: unemployment (higher unemployment, higher enrollment), income (income higher, enrollment higher), type $ (higher cost, lower enrollment), class schedule, gender, race/ethnicity, family, major, technology, etc.

- All of these things, one way or another, effect college enrollment in one way or another. However, this is too much stuff to look at, so we need to narrow it down the underlying structure. So we then narrow it down to key factors:

- Factor 1 would be Financial Inputs- So things like: Unemployment, Income, Type $, Marriage, High School Graduation, Personal Development, Housing.

- Factor 2 would be Exogenous Inputs - So things like: Age, Gender, Race/Ethnicity, Technology, Major, Class Schedule.

- By grouping the factors, it will help to identify the underlying structure and make it easier to analyze.

- Concepts

- Factor Loading (pg 233) - the main set of results obtained from factor analysis. A factor loading is interpreted as the Pearson correlation coefficient of an original variable with a factor. Ranges from -1 to +1. -1 represents a perfect negative association with factor while +1 represents a perfect positive association with factor and 0 means that there is no association. Variables typically will have loading on all factors but will usually have high loadings on only one factor.

- Dr. Yu's Additional Information - 0 = no association. Each Factor will have a relationship with each variable, but some variables will have a stronger relationship with the factor then others.

- Communalities (pg 233) - is an index that provides the results of a factor analysis in a list for each variable. Communalities represent the proportion of variability for a given variable that is explained by the factors and allows the researcher to examine how individual variables reflect the sources of variability. Communalities may also be interpreted as the squared multiple correlation of the variable as predicted from the combination of factors, or as the sum of squared loadings across all factors for that variable.

- Dr. Yu's Additional Information - a percentage measure of the relationship between the factor and the variable.

- Using our example above for Factor 2 we would assign percentages to variables within a factor from the results in SPSS. For example we may find that Age effects Factor 2 60%, Gender 20%, Major 10%, etc. This does not have to add up to 100% if there are other factors; however, it cannot add up to more than 100%.

- Extraction & Methods of Extraction (pg 233 & 234)

- Extraction - the process by which the factors are determined from a larger set of variables. In other words, this is the process we go through to group together the variables into factors to identify the underlying structure.
**RULE: 70% OF THE TOTAL VARIANCE MUST BE MAINTAINED**. This means that at least 70% of the factor must be explained by the variables in the results.

- Extraction - the process by which the factors are determined from a larger set of variables. In other words, this is the process we go through to group together the variables into factors to identify the underlying structure.

- There are two methods of extraction:

- 1. Factor Analysis - Statistical technique that can be used to determine the underlying structure from a large set of variables. The main purpose is to achieve data/variable reduction. This is used for a more generalized answer.

- 2. Principle Components Analysis - all sources of variability are analyzed for each observed variable. The main purpose is to extract as much variance as possible or to analyze certain behaviors. This is used for more specific/in-depth answer.

- Eigenvalue or Eigen Factor (pg 234) - the amount of total variance explained by each factor, with the total amount of variability in the analysis equal to the number of original variables in the analysis.

- Or in English from our example: Factor 2 results includes: Age 60%, Race 15%, and Gender 15%. We then add that up and the Eigenvalue would be 90%.

- Scree Plot (pg 234) - a graph of the magnitude of each eigenvalue (vertical) plotted against their ordinal numbers or number of components (horizontal). So basically we're looking for a fracture point where the curve becomes "flat". By looking at this graph, we could say that after a certain point the curve will become fairly flat. This shows that as variables are taken out, the eigenvalue will continue to drop and yield worse results. The information from where the graph levels out are the ones that would be dropped as the information is not significant.

## Handouts

### Simple Regression Handout

#### SPSS Steps

- Input data.

- Click "Analyze"

- Click "Regression"

- Click "Linear"
- Select DV(s).
- Select IV(s).
- Choose the desired method (choose the default of Enter unless requested otherwise).

- Click "Linear"

- Click "Statistics"
- Select everything except Covariance Matrix.
- For Casewise go with the default of Outliers outside 3 standard deviations.
- Click "Continue"

- Click "Statistics"

- Click "Plots"
- Choose ZRESID for X-axis
- Choose ZPRED for X-axis
- Select Normal Probability Plot
- Select Produce All Partial Plots
- Click "Continue"

- Click "Plots"

- Click "Save"
- Select Predicted Values, Unstandardized
- Select Residuals, Unstandardized
- Select Residuals, Standardized

- Click "Save"

- Click "Continue"

- Click "OK"

#### Reading the SPSS Results

1. Look at the "Coefficient" results.

- In the following example:
- PCE = -231.777 (located under - Unstandardized Coefficient B).
- GDP = 0.719 (located under - Unstandardized Coefficient GDP).
- Fill this in the Estimated Regression Equation (PCE = a + bGDP). -231.777 = a + 0.179GDP.
- You can then solve for a (-231.777 + 0.179 = a).

2. Look at the "t" column of the "Coefficient" box for the constant. This is where to find the p-value/sig value for the t-test.

- In the following example:
- Calculated t = -2.451.
- Since we don't know the t-table, then we go to the p-value approach.

3. Look at the "sig" column of the "Coefficient" box for the IV.

- In the following example:
- sig = .000
- p-value (.000) < .05, due to this we Reject Ho. By rejecting Ho, we are stating that there is a correlation between the IV and DV.

4. Look at the "ANOVA" box, look at the "sig" column for "Regression". This is where to find the sig for the F-Test.

- In the following example:
- sig = .000
- p-value (.000) < .05, due to this we reject Ho, again stating that there is a correlation between the IV and DV.

5. Look at the "Model Summary" box, use "Adjusted R Square"

- In the following example:
- R^2 = .990. This means that GDP can be used to explain 99% of the PCE behavior.

#### Simple Regression Example

Year/GDP/PCE/PRE_1

1980/3776.3/2447.1/2485.01108

1981/3843.1/2476.9/2533.06908

1982/3760.3/2503.7/2473.50018

1983/3906.6/2619.4/2578.75297

1984/4148.5/2746.1/2752.78339

1985/4279.8/2865.9/2847.24471

1986/4404.5/2969.1/2936.95779

1987/4539.9/3052.2/3034.36878

1988/4718.6/3162.4/3162.93114

1989/4838.0/3223.3/3248.83123

1990/4877.5/3260.4/3277.24876

1991/4821.0/3240.8/3236.60090

#### Simple Regression Steps

1. Know the Estimated Regression Equation. In this example the Estimated Regression Equation is PCE = a + bGDP.

- In this example, based on the results, we would write: Based on our results, we have determined that there is a positive relationship between GDP and PCE. For each dollar that the GDP goes up, the PCE will go up by approximately $0.719.
- We need to be able to give the relationship. Positive Relationship - they move in the same direction. Negative Relationship - they move in opposite directions. In this example, there is a positive relationship.
- We need to be able to explain by how much. This is the "b" of the equation.

2. Setup your hypotheses:

- In this example Ho: b = 0 and Ha: b does not = 0.

3. Identify alpha. In this case alpha = 0.05.

4. Determine if t-cal > t-table or t-cal < t-table.

- If the t-cal is less than the t-table or negative, then you Reject Ho. By rejecting Ho you are stating that there is a correlation between the IV and DV.

5. Then look at the p-vale. If p-value < alpha, then Reject Ho.

### Multiple Regression Handout

#### SPSS Steps

- Input data.

- Click "Analyze"

- Click "Regression"

- Click "Linear"
- Select DV(s).
- Select IV(s).
- Choose the desired method (choose the default of Enter unless requested otherwise).

- Click "Linear"

- Click "Statistics"
- Select everything except Covariance Matrix.
- For Casewise go with the default of Outliers outside 3 standard deviations.
- Click "Continue"

- Click "Statistics"

- Click "Plots"
- Choose ZRESID for X-axis
- Choose ZPRED for X-axis
- Select Normal Probability Plot
- Select Produce All Partial Plots
- Click "Continue"

- Click "Plots"

- Click "Save"
- Select Predicted Values, Unstandardized
- Select Residuals, Unstandardized
- Select Residuals, Standardized

- Click "Save"

- Click "Continue"

- Click "OK"

#### Reading the SPSS Results

Multiple Regression Example 1 Results

Multiple Regression Example 2 Results

##### Example 1

- 1. Estimated Regression Equation - Look at the "Coefficient" results.

- In this example:

- Sales = 3.453 (located under - Unstandardized Coefficient B).
- TP = 0.496 (located under - Unstandardized Coefficient GDP).
- PCI= 0.009 (located under - Unstandardized Coefficient GDP).
- Fill this in the Estimated Regression Equation (Sales= a + bTP + bPCI). Sales = 3.453 + -0.496TP + 0.009PCI.

- 2. T-Test - Look at the "t" column of the "Coefficient" box for the constant. This is where to find the p-value/sig value for the t-test.

- In the following example:

- Calculated t-TP = 81.924.
- Calculated t-PCI = 9.502.

- Since we don't know the t-table, then we go to the p-value approach.

- 3. T-Table P-Value Approach - Look at the "sig" column of the "Coefficient" box for the IV.

- In the following example:

- sig-TP = .000
- p-value (.000) < .05, due to this we Reject Ho. By rejecting Ho, we are stating that there is a correlation between the IV and DV.

- sig-PCI = .000
- p-value (.000) < .05, due to this we Reject Ho. By rejecting Ho, we are stating that there is a correlation between the IV and DV.

- sig-TP = .000

- By rejecting Ho of both TP and PCI we are stating that neither TP nor PCI are equal to 0 and that means that they are both good variables for Sales.

- 4. F-Test - Look at the "ANOVA" box, look at the "sig" column for "Regression". This is where to find the sig for the F-Test.

- In the following example:

- F-Calc = 5679.466
- sig = .000
- p-value (.000) < .05, due to this we reject Ho, again stating that there is a correlation between the IV and DV. Which means at least TP or PCI are not equal to 0.

- 5. R^2 - Look at the "Model Summary" box, use "Adjusted R Square"

- In the following example:

- R^2 = .999. This means that we are 95% confident that the model can be used to explain 99.9% of the Sales behavior.

- 6. Then look at the coefficients in the Estimated Regression Equation.

- In the following example:

- TP - for every thousand people increase of TP, Sales will increase by .496 gross.
- PCI - for every dollar increase of PCI, Sales will increase by .009 gross.

- 7. Forecasting - If we predict the mean in 5 years for TP will be 300 and for PCI will be 5,000, then we would fill in the estimated regression equation.

- Sales = 3.453 + 0.496(300) + 0.009(5000) + 2.431
- Sales = 197.253
- So we think in 5 years sales will be 197.253 gross.

##### Example 2

- 1. Estimated Regression Equation - Look at the "Coefficient" results.

- In this example:

- a = -113.405 (located under - Unstandardized Coefficient B).
- AAI = 4.843 (located under - Unstandardized Coefficient GDP).
- RAS= 3.715 (located under - Unstandardized Coefficient GDP).
- Fill this in the Estimated Regression Equation (ALIC= a + bAAI+ bRAS). ALIC = -113.405 + 4.843AAI + 3.715RAS.

- 2. T-Test - Look at the "t" column of the "Coefficient" box for the constant. This is where to find the p-value/sig value for the t-test.

- In the following example:

- Calculated t-AAI = 32.472.
- Calculated t-RAS = 5.166.

- Since we don't know the t-table, then we go to the p-value approach.

- 3. T-Table P-Value Approach - Look at the "sig" column of the "Coefficient" box for the IV.

- In the following example:

- sig-AAI= .000
- sig-RAS= .000

- sig-AAI= .000

- By rejecting Ho of both AAI and RAS we are stating that neither AAI nor RAS are equal to 0 and that means that they are both good variables for Sales.

- 4. F-Test - Look at the "ANOVA" box, look at the "sig" column for "Regression". This is where to find the sig for the F-Test.

- In the following example:

- F-Calc = 623.641
- sig = .000
- p-value (.000) < .05, due to this we reject Ho, again stating that there is a correlation between the IV and DV. Which means at least AAI or RAS are not equal to 0.

- 5. R^2 - Look at the "Model Summary" box, use "Adjusted R Square"

- In the following example:

- R^2 = .987. This means that we are 95% confident that the model can be used to explain 98.7% of the RAS behavior.

- 6. Then look at the coefficients in the Estimated Regression Equation.

- In the following example:

- AAI - for every thousand dollars increase of AAI, ALIC will increase by 4.843 thousand dollars.
- RAS - for every unit increase of RAS, ALIC will increase by 3.715 thousand dollars.

- 7. Forecasting - If we predict the mean in 5 years for AAI will be 50 and for RAS will be 10, then we would fill in the estimated regression equation.

- ALIC = -113.405 + 4.843(50) + 3.715(10)
- ALIC = 165.895 thousand dollars
- So we think in 5 years ALIC will be 165.895 thousand dollars.

#### Multiple Regression Example

##### Example 1

The Zarthan Company sells a special skin cream through drugstores exclusively. It operates in 15 marketing districts and is interested in predicting data on target population and per capita income. Sales are to be treated as the dependent variable Y, and target population and per capita income as independent variables X1 and X2 respectively, in an exploration of the feasibility of predicting district sales from target population and per capita income. The first-order model: Y1 = a + b1AAI + b2ALIC with normal error terms is expected to be appropriate.

District/Sales(Gross)/Target Population(thousands of persons)/Per Capita Income (dollars)

1/162/274/2450

2/120/180/3254

3/223/375/3802

4/131/205/2838

5/67/86/2347

6/169/265/3782

7/81/98/3008

8/192/330/2450

9/116/195/2137

10/55/53/2560

11/252/430/4020

12/232/372/4427

13/144/236/2660

14/103/157/2088

15/212/370/2605

##### Example 2

The following data is a sample of 18 physicians in the 35 to 39 age group holding policies with a certain life insurance company, average annual income during the past 5 years (X1), risk aversion score(X2), and insurance carried (Y). Risk aversion was measured by a standard questionnaire administered to each physician in the sample; the higher the score the greater the degree of risk.

Physician/AAI (thousand dollars) / RAS / ALIC (thousands dollars)

1/4735/7/140

2/29.26/5/45

3/52.14/10/180

4/32.15/6/60

5/40.86/4/90

6/19.18/5/10

7/27.23/4/35

8/25.60/6/35

9/54.14/9/190

10/26.72/5/35

11/38.84/2/75

12/32.99/7/70

13/32.95/4/55

14/21.69/3/10

15/27.90/5/40

16/56.70/1/175

17/37.69/8/95

18/39.94/6/95

#### Multiple Regression Steps

1. Know the Estimated Regression Equation. In this example the Estimated Regression Equation is PCE = a + bGDP.

- In this example, based on the results, we would write: Based on our results, we have determined that there is a positive relationship between GDP and PCE. For each dollar that the GDP goes up, the PCE will go up by approximately $0.719.
- We need to be able to give the relationship. Positive Relationship - they move in the same direction. Negative Relationship - they move in opposite directions. In this example, there is a positive relationship.
- We need to be able to explain by how much. This is the "b" of the equation.

2. Setup your hypotheses:

- In this example Ho: b = 0 and Ha: b does not = 0.

3. Identify alpha. In this case alpha = 0.05.

4. Determine if t-cal > t-table or t-cal < t-table.

- If the t-cal is less than the t-table or negative, then you Reject Ho. By rejecting Ho you are stating that there is a correlation between the IV and DV.

5. Then look at the p-vale. If p-value < alpha, then Reject Ho.

### Factor Analysis

#### SPSS Steps

- Input data.

- Click "Analyze".

- Click "Dimension Reduction".

- Click "Factor".

- Put all variables into the Independent Variable Box.

- Click "Descriptives":
- Under Statistics select: "Univariate Descriptives" and "Initial Solution".
- Under Correlation Matrix select: "Coefficients" and "Reproduced".
- Click "Continue"

- Click "Extraction"

- Stay with the default method of "Principal Components".
- Under Analyze stay with the default of "Correlations Matrix".
- Under Display stay with the default of "Unrotated Factor Solution".
- Under Display select: "Scree Plot".
- Under Extract select "Fixed Number of Factors".
- In the box for "Factors to Extract" type in the number of factors.

- Click "Continue".

- Click "Rotation".

- Choose "Varimax".
- Select "Rotated Solution".
- Click "Continue"

- Click "OK".

#### Reading the SPSS Results

- Unrotated Solution - What percentage of the total variance in the 7 standardized variables is explained by the first common factor and the second common factor?:

- Look at the box "Total Variance Explained".
- Look at the column "% of Variance".
- Add together the "% of Variance" for your factors, this has to be 70% or greater.
- This gives you the Unrotated Solution. The components are the factors, in our example we would only look at Component 1 (Factor 1) and Component 2 (Factor 2). If we wanted to reach 100% variability we would have to have 7 Factors/Components. This is where we have to add up the percentages of the factors that we are using. In this case, Component/Factor 1 is 72% and Component/Factor 2 is 13% so we have a total of 85% which meets the rule that it must account for at least 70% of variance so we can continue.

- In the Rotated Solution - What percentage of the total variance in the 7 standardized variables is explained by the first common factor and the second common factor?

- Look at the box "Total Variance Explained"
- Look at the "Rotation Sum of Square Loading"
- Look at the "% of Variance"
- Add up the "% of Variance" for your factors.
- This gives you the Rotated Solution.

- Communalities - What percentage of the variance in zNewsales is explained by the two common factors combined? What percentage of the variance in zNewsales is due to the factor(s) specific to Newsales?

- Look at the box "Communalities"
- Look at the "Extraction" column.
- This is the percent for the first question. 89.3% is the answer for this example.
- To answer the second question subtract the first answer from 100%. In this example: 100-89.3 = 10.7%.
- This tells you how much each variable attributes to all Factors.

- Assuming there are two common factors and based on the rotated solution, do you have a suggestion as to what they might be?

- Look at the "Rotated Component Matrix".
- By looking at these percentages of these variables, we can see that on Factor 1: Abstract, Math, and Growth are the highest so we could call Factor 1 "Right Brain". We can also see that on Factor 2: Creativity, Mechanical, and New Sales are the highest so we could call Factor 2 "Left Brain".

- Look at the "Rotated Component Matrix".

#### Simple Regression Example

A firm is attempting to evaluate the quality of its sales staff and is trying to find an examination or series of tests that may reveal the potential for good performance in sales. The firm has selected a random sample of 50 salespeople and has evaluated each on 3 measures of performance: growth of sales, profitability of sales and new-account sales. These measures have been converted to a scale, on which 100 indicates "average" performance. Each of the 50 individuals took each of 4 tests, which purported to measure creativity, mechanical reasoning, abstract reasoning, and mathematical ability respectively. The resulting data is contained in the file below named "Factor Analysis Data Set".

Factor Analysis Data Set Factor Analysis OutPut

## Exam Reviews and Answers

### Fall 2010 - BUSA 3000

#### Exam 1 Answers

- 1. Orthogonality means that there is overlap, but the overlap is not assigned to any variable.

- 2. Standard Analysis means that overlap is counted, but not assigned to any particular variable.

- 3. Sequential Analysis - means that overlap is counted and assigned based on the priority of the variable.

- 4. Quantitative Variable - can be counted using continuous numbering systems. ::Ex. Age

- 5. Categorical Variable - cannot be counted and typically has categories assigned to numbers.
- Ex. Rural, Suburban, Urban

- 6. Dichotomous Variable - has only two possible categories.
- Ex. Gender (male or female).

- 7. Power of a statistical test is defined as

- 8. Power is determined as: Power = 1 - a

- 9. The four types of research are:
- 1. Relationship - correlation (pearson/spearman). Ex. If SAT scores go up than do ACT scores go up?
- 2. Differences - t-test, ANOVA, ANCOVA, MANOVA. Ex. Male & female SAT scores.
- 3. Membership - discriminate analysis. Ex. Which risk taking behaviors among alcohol, drugs, and sex distinguish between suicide attempters and non-attempters?
- 4. Structure - factor analysis. Ex. HE DID NOT SAY.

- 10. What are the four main purposes for screening data prior to conducting a multivariate analysis?
- 1. Check accuracy.
- 2. Define outliers.
- 3. Find missing data.
- 4. Screening concerning assumptions
- 1. Linearly
- 2. Commonality
- 3. Homostasticity

- 11. Question Test #
- To what degree do SAT scores predict freshman college GPA? 3
- What are the causal effects (direct and indirect) among number of school absences due to illness, reading ability, semester GPA, and total score on Iowa Test of Basic Skills amonth 8th grade students? 7
- Do males and females have significantly different SAT scores? 1
- What is the relationship between SAT scores and freshman college GPA? 9
- Which risk taking behaviors distinguish non-suicide attempters from suicide attempters? 6
- Do preschoolers of low, middle, and high socioeconomic status have different literacy test scores after adjusting for family type? 2
- Which combination of risk taking behaviors best predicts the amount of suicide behavior among adolescents? 10
- Do preschoolers of low, middle, and high socioeconomic status have different literacy test scores? 8
- To what extent do certain risk taking behaviors increase the odds of suicide attempt occurring? 5
- What underlying structure exists among the following variables: amount of alcohol use, drug use, sexual activity, school misconduct, cumulative GPA, reading ability and family income? 4

# Test 1 t-test 2 ANCOVA 3 Simple Regression 4 Factor Analysis 5 Logistic Regression 6 Discriminate analysis 7 Path Analysis 8 ANOVA 9 Correlation 10 Multiple Regression

- 12. Ho: u1 = u2 and Ha:

- 13. If p < alpha then reject the Ha hypothesis.

- 14. Failed to reject the Ha hypothesis because p > alpha.

- 15. Write down the estimated equation: y = a+bx

- 1. sales - dependent variable "y"

- 2. year - independent variable "x"

- 3. y = 115.982 + 1.382(x)

- 16. Sales 2012 - 133.8 and Sales 2015 = 138

- 17. Is the relationship between year and sales linear? Yes

- 18. Pearson Coefficient = .585 and Spearman Coefficient = .543

- 19. Based on the SPSS result, the relationship between GPA and SAT is statistically: Insignificant.

- 20. because: p value < alpha

- 21. Are the two variables positively related or negatively related? Positively

#### Exam 2 Review

- In general, the students agreed that this test was easier than the first exam.

We will be having two sections: text book questions and computer. Can bring in a sheet of paper with just the steps. At the end of the test, we will have to sign the paper and he will audit it.

- Textbook Questions

- Chapter 4
- One-Way ANOVA (Cause and Manufacturers Handout) need to know that.
- Two-Way ANOVA – Factor A and Factor B: page 70-71.
- Different Types of Interaction (charts): page 73.
- Two-Way Analysis in terms of the variances: 72-76. Figure 4.5 (know this).

- Chapter 4

- Chapter 5
- Concomitant, Partialed Out, Covariate (notes and page 93).
- If given the hypotheses be able to give the research question. If given the research question, then be able to give the hypotheses.
- 5-Steps to Research

- Chapter 5

- Chapter 6
- MANOVA – need to be able to do the summary report, know the steps, the things, the factors, to put in the summary report.
- Writing Up the Results: pg. 127. THIS WILL BE ON THE TEST!

- Chapter 6

- Computer SPSS Based
- Two-Way Factorial Analysis (Ch 4) – will be given a data set (short) and we need to know how to formulate the data sheet (set it up). Then need to be able to find the results (pg 88- 89). Will be asked to provide the variables, the sources of the variation (pg 89, table 2 – be able to completely understand the table).
- Ch 5 – Develop research questions, main effects, and the interaction effects.
- P-Value – be able to identify for interaction and main effects.

- Ch 6 – same thing. But he may ask a question about what we covered in Ch 3. May give a computer result for homoscandasticity. Will be given a computer result, M-Test.

#### Final Exam

- Chapter 7:
- No questions from Chapter 7.

- Chapter 8:
- Path Analysis - will get diagrams and will be asked questions.
- Definition of Concepts
- Two tables with numbers and we will be asked if we will need to revise the model or not revise the model.

- Chapter 9:
- Define - Factor Analysis, Factor Loading, Communalities, Extraction, Eigan Values, and Scree Plot.

- SPSS
- Chapter 7: Multiple Regression - will be given variables and will have to perform a regression analysis. We will be keying in the data. Know how to define the variables and key in the data.
- Chapter 8: No SPSS.
- Chapter 9: Factor Analysis - we will download the data set from the website and then perform the analysis and we will be asked questions.