# How Do I Perform Logistic Regression in Excel Using SigmaXL?

## Binary Logistic Regression

Watch the video below to see our Co-Founder John Noguera present a Webinar on Logistic RegressionBinary Logistic Regression is used to analyze the relationship between one binary dependent variable (Y) and multiple independent numeric and/or discrete variables (X's). It is used to discover the relationship between the variables and create an empirical equation of the form:

Ln(Py/(1-Py)) = b0 + b1*X1 + b2*X2 + ... + bn*Xn

This equation can be used to predict an event probability Y value for a given set of input X values. SigmaXL uses the method of maximum likelihood to solve for the model coefficients and constant term. Statistical tests of hypothesis and odds-ratios are provided for the model coefficients. The odds-ratios identify change in likelihood of the event for one unit change in X.

An example application from medical research would be Y=Disease (Yes/No) and X’s = Age, Smoker (Yes/No), Number Years of Smoking and Weight. The model coefficient p-values would indicate which X’s are significant and the odds-ratios would provide the relative change in risk for each unit change in X.

We will analyze the familiar Customer Satisfaction data using Y=Discrete Satisfaction where the values have been coded such that an Overall Satisfaction score >= 3.5 is considered a 1, and scores < 3.5 are considered a 0. Please note, we are not advising that continuous data be converted to discrete data in actual practice, but simply using the Discrete Satisfaction score for continuity with the previous analysis.

- Open
**Customer Data.xls**. Click**Sheet 1**Tab (or press**F4**to activate last worksheet). Click**SigmaXL > Statistical Tools > Regression > Binary Logistic Regression**. If necessary, click**Use Entire Data Table**, click**Next**. - Select
*Sat-Discrete*, click**Binary Response (Y) >>**; select*Responsive to Calls*and*Ease of Communications*, click**Continuous Predictors (X) >>**; select*Customer Type*, click**Categorical Predictors (X) >>**.

Note that

**Response Count (Y)/Sample Size (Trials)**should be used when each record contains both the number of occurrences along with associated sample size. This is common when tracking daily quality data or performing design of experiments where each run contains a response of the number of defects and sample size.

- Click
**OK**. The resulting Binary Logistic Regression report is shown:

- The Likelihood Ratio p-value < .05 tells us that the model is significant. The p-values for the coefficients in the Parameter Estimates table confirm that Responsive to Calls and Ease of Communications are significant.
- The p-value in Wald Estimates for Categorical (Discrete) Predictors table tells us that Customer Type is not significant here.

**Tip:**Significance for categorical predictors should be based on the Wald Estimates not the p-values given in the Parameter Estimates table. - Note that Customer Type 1 is not displayed in the Parameter Estimates table. This is the "hidden" reference value for Customer Type. Categorical predictors must have one level selected as a reference value. SigmaXL sorts the levels alphanumerically and selects the first level as the reference value.
- Now we will rerun the binary logistic regression but remove
*Customer Type*as a predictor. Press**F3**or click**Recall SigmaXL Dialog**to recall last dialog. Remove*Customer Type*by highlighting*Customer Type*and double-clicking (or press the**Remove**button).

- Click
**OK**. The resulting Binary Logistic report is shown: - The Odds Ratios in the Parameter Estimates table tell us that for every unit increase in Responsive to Calls we are 11.7 times more likely to obtain a satisfied customer. For every unit increase in Ease of Communications we are 13.3 times more likely to obtain a satisfied customer.
- McFadden’s Pseudo R-Square mimics the R-square found in linear regression. This value varies between 0 and 1 but is typically much lower than the traditional R-squared value. A value less than 0.2 indicates a weak relationship; 0.2 to 0.4 indicates a moderate relationship; greater than 0.4 indicates a strong relationship. Here we have an R-square value of 0.66 indicating a strong relationship. This is also confirmed with the Percent Correctly Predicted value of 96%.
- The Pearson, Deviance and Hosmer-Lemeshow Goodness of Fit tests are used to confirm if the binary logistic model fits the data well. P-values < .05 for any of these tests indicate a significant lack of fit. Here the Hosmer-Lemeshow test is indicating lack of fit. Residuals analysis will help us to see where the model does not fit the data well.
- The measures of association are used to indicate the relationship between the observed responses and the predicted probabilities. Larger values for Goodman-Kruskal Gamma, Somers’ D and Kendall’s Tau-a indicate that the model has better predictive ability.
- The residuals report is given on the Sheet
**Binary Logistic Residuals**. Three types of residuals are provided: Pearson, Standardized Pearson and Deviance. The Standardized Pearson Residual is most commonly used and is shown here plotted on a Run Chart. To create, click**SigmaXL > Graphical Tools > Run Chart**, check**Use Entire Data Table**to select the*Residuals*data, click**Next**, select*Std. Pearson Residual*as the**Numeric Data Variable (Y)**. Click**OK**. - Reselect the
**Binary Logistic Report**sheet. Scroll over to display the Event Probability calculator: - Note that if the calculator includes predictors that are categorical (discrete), enter a 0 or 1 to denote the selected level as shown below (using the original analysis which included Customer Type):

Any Standardized Pearson Residual value that is less than -3 or greater than +3 is considered extreme and should be investigated. There are 3 such outliers here: rows 24, 67, and 77 in the residuals table. The +3.4 value indicates that the predicted event probability was low (.08) but the actual result was a 1. The -6.6 value indicates that the predicted event probability was high (.98) but the actual result was a 0. The large negative residuals have high Responsive to Calls and Ease of Communications but dissatisfied customers. The reasons for these discrepancies should be explored further but we will not do so here.

This calculator provides a predicted Event Probability for given values of X (in this case the

If Responsive to Calls and Ease of Communications are both equal to 3, the probability of a satisfied customer is only .07 (7%); if Responsive to Calls and Ease of Communications are both equal to 5, the probability of a satisfied customer is .9995 (99.95%)

If we wanted to select Customer Type 1, enter a 0 for both Customer Types 2 and 3. Customer Type 1 is the hidden reference value.

### Ordinal Logistic Regression

Ordinal Logistic Regression is used to analyze the relationship between one ordinal dependent variable (Y) and multiple independent continuous and/or discrete variables (X's).We will analyze the Customer Satisfaction data using Y=Loyalty –Likely to Recommend score which contains ordinal integer values from 1 to 5, where 5 indicates that the customer is very likely to recommend us and 1 indicates that they are very likely to not recommend us.

- Open
**Customer Data.xls**. Click**Sheet 1**Tab (or press**F4**to activate last worksheet). Click**SigmaXL > Statistical Tools > Regression > Ordinal Logistic Regression**. If necessary, click**Use Entire Data Table**, click**Next**. - Select
*Loyalty – Likely to Recommend*, click**Numeric Ordinal Response (Y) >>**; select*Responsive to Calls*and*Ease of Communications*, click**Continuous Predictors (X) >>**.

- Click
**OK**. The resulting Ordinal Logistic report is shown: - The Likelihood Ratio p-value < .05 tells us that the model is significant. The low p-values for the coefficients confirm that Responsive to Calls and Ease of Communications are significant.
- The Odds Ratios tell us that for every one-unit increase in Responsive to Calls the chance of a Loyalty score of 1 versus 2 (or 2 versus 3, etc.) is reduced by a multiple of 0.36. This is not very intuitive but will be easy to see when we use the Response Outcome Probability calculator.
- McFadden’s Pseudo R-Square value is 0.185 indicating that this is a weak (but close to moderate) degree of association. This is also confirmed with the Percent Correctly Predicted value of 48%.
- The Pearson and Deviance Goodness of Fit (GOF) tests are used to confirm if the ordinal logistic model fits the data well. P-values < .05 would indicate a significant lack of fit. Given that the GOF p-values are greater than .05, we conclude that there is no significant lack of fit.
- Scroll across to the Response Outcome Probability calculator. This calculator provides predicted outcome (event) probabilities for given values of X (in this case the probability of a satisfied customer). Enter the values 3,3; 4,4; 5,5 as shown:

- Referring to the Predicted Probability for each Level, if Responsive to Calls and Ease of Communications are both equal to 3, we would expect to see typical loyalty scores of 3 (57%) with some at 2 (20%) and 4 (20%); if Responsive to Calls and Ease of Communications are both equal to 5, we would expect typical loyalty scores of 4 (53 %) with some at 3 (34%) and 5 (9%).

**Click here to download a Free 30-Day Trial.**