How Do P-Values Get Calculated for Linear and Logistic Regression?

Ira Seidman
9 min readDec 5, 2022

--

By Ira Seidman — freelance healthcare Python data analyst and high school math teacher

In machine learning we are often just concerned with the accuracy of our models — as long as the model can predict better than the majority class with classification or with a better RMSE than the mean for regression on unseen data, the model has some kind of value. Whether this is for predicting who will default on their loans, flagging at-risk students, determining how much homes will sell for in Ames Iowa, or whatever the type of question you are trying to predict for, data science is primarily interested in accuracy. However, if your goal is to answer a question, you are now entering the realm of inference which leans more on p-values than accuracy to prove statistical significance. While p-values alone are not enough information to demonstrate causation, they are critical for demonstrating a non-zero correlation which is still valuable for understanding how different data connect. This post will discuss how to solve for p-values with linear and logistic regression using Excel and Statsmodels.

What is a P-value?

A p-value is a probability, a number between 0 and 1, that reflects the chances that the null hypothesis is correct. The null hypothesis says that there is no correlation between independent and dependent variables. This means that the lower the p-value is, the less likely the null hypothesis is to be correct or the more likely it is that we can reject it and demonstrate that there is a correlation between the two variables (independent and dependent). In general, a p-value below .05 is considered statistically significant (this cut-off is known as alpha), but the threshold could be higher or lower depending on the industry standards (such as .1 or .01).

For example, a linear regression with a Pearson correlation of .9 is considered a strong correlation but not necessarily significant; for one sample it may be significant but not for another as the sample size matters — the larger the sample size the less likely it is that a high correlation value is a statistical anomaly. If the p-value associated with this correlation is below .05 we could then conclude that the correlation is significant and further investigate the regression coefficients to better understand the relationship between the two variables. All of these values can be found in a regression summary table which is discussed below and the same principle applies for logistic regression.

Linear Regression:

When data is being used to predict for a continuous target there will be a p-value to reflect the probability that the relationship is non-zero. The p-value, which is a reflection that both the r value is significant as well as the regression coefficients likely do not include 0 (with 95% confidence for an alpha of .05), comes from the Pearson correlation r and sample size n. These two values give us a t-statistic (see equation below) which can then be used to solve for the p-value using a table or Excel functions (outlined further below):

The t-statistic can also be derived from the coefficient divided by the standard error or using a comparison between sample and population data. The t-statistic, like z-score, really just represents how many standard deviations from the population mean our sample mean appears to be and the further this is the less likely the result is due to chance.

As r and n go up so too does our t-statistic and the higher the t-statistic the lower our p-value will be per the table below; the t-statistic values in the table represent the critical points, so any t-stats above the ones listed (or below the negative versions when dealing with negative t-statistics) for our given p and r values will be considered statistically significant:

P-Value Table

This video shows how to solve for a correlation’s p-value in Excel using the same equation above to solve for t given n and r and then plugging that result into the function TDIST.2T; see below for how to get a p-value in Excel using formulas with the sklearn diabetes dataset (which is on the first tab called data for the Excel file in this post’s GitHub repository):

Excel Manual P-Value Calculations

Here’s the same tab with the formulas shown so you can see how I got each value:

Excel Manual P-Value Calculations Formulas

These same results are confirmed using the Excel regression tool which can be found on the Data tab on the far right by clicking “Data Analysis” and for detailed instructions on how to fill out the regression set-up check out this video:

Excel Univariate Regression Summary

Below is the output for the multivariate regression summary in Excel — note that the t-statistic and p-value for the age predictor do not match what they were in the univariate model but this is to be expected because we have more data now so the regression will optimize in different ways:

Excel Multivariate Regression Summary

In addition to Excel functions you can use libraries like Statsmodels in Python (or any other statistical package) to return regression summary tables. For the same data, see the summaries below and how they match the t-statistics and p-values from Excel pretty closely:

Statsmodels Univariate Linear Regression Summary

The very low p-value for age that appears in both the Excel and Statsmodels regression summaries gives me confidence that age is a strong predictor of disease progression and would lead me to look for an age where disease progression starts to worsen and make a recommendation that doctors start to pay closer attention to breast cancer around this age. Ideally, I would make this hypothesis before collecting the data, and use the data to confirm or reject my hypothesis rather than create the finding from the data. There are similar results in the multivariate regression:

Statsmodels Multivariate Linear Regression Summary

Categorical predictors will also be included in the regression summary so long as they are dummified first; the regression coefficients will then reflect the change in the target given the presence of the variable. For example, in the multivariate regression above sex was one of the columns that reflects if the subjects were female (0 or -0.0446 as a z-stat) or male (1 or 0.0507 as a z-stat) and the coefficient was -239.82 which reflects how men have a higher disease progression than women within the sample and on average:

Finally, when using Statsmodels make sure to have the line of code for adding constants (shown above) before setting up the model and also be careful to use the statsmodels.OLS (Ordinary Least Squares) method when modeling linear regression. This analysis will work for predicting any continuous target and for predicting categorical targets you will want to use logistic regression — read on.

Logistic Regression:

The coefficients for logistic regression do not represent the same change in x compared to y that the coefficients for linear regression do — for logistic regression these values show the change in log odds of the dependent variable for a one unit increase in the independent variable. For more background on this distinction check out a previous post of mine Which Function Does Logistic Regression Use?

If the sigmoid for our logistic regression model really does reflect a real and statistically significant relationship between the two variables being compared this should reflect with a low p-value. Like with linear regression, the higher the t-statistic, or z-score as it appears in the Statsmodels regression summary, the lower the p-value will be. Again, the z-score is the coefficient over the standard error and below is the regression summary in Statsmodels for the sklearn breast cancer dataset which is a binary classification target — 0 for no cancer and 1 for the presence of cancer:

Statsmodels Multivariate Binomial Logistic Regression Summary

The alternative to using Statsmodels is using sklearn’s model.intercept_, model.coef_, and model.score but these results are limited compared to what Statsmodels returns. There is a way to get the coefficients in Excel using the Solver tool to optimize (this is the tutorial I worked off of), but I warn you it is manual and the results for the two logistic regression datasets I am working with did not converge for me. There are other statistical packages and add-ins for Excel that include logistic regression that should work better.

For multi-class classification I used the iris dataset which has three targets: 0 — setosa, 1 — versicolor, and 2 — virginica. The Statsmodels setup is almost the same as for binary logistic regression except we are calling sm.MNLogit instead of sm.Logit:

Statsmodels Multivariate Multinomial Logistic Regression Summary

This output will include n-1 classes because just like for binomial logistic regression there are not outputs for both class 0 and class 1 but just for class 1. I used method = ‘bfgs’ to avoid a NaN error that I was getting and when I used more than one predictor there were too many iterations and the code bombed out. Hopefully this is enough to get you started interpreting, validating, and most importantly understanding your logistic regression results.

Conclusion:

The way that p-values get calculated for linear and logistic regressions has to do with comparing the variance seen in our models to the total variance of just guessing the mean each time or the majority class (our baseline models). The more of this variance that can be explained, the more likely we can reject the null hypothesis that there is no relationship between the two variables being compared. Finally, if your goal is to demonstrate causation you would have to show directionality, that there is an element of timing where one variable follows the other, in addition to ruling out alternative hypotheses. Here is an interesting study that tries to prove causation between student self-control and GPA and has a section “Beyond Prediction: Establishing the Causal Role of Self-Control in Academic Achievement” that outlines John Stuart Mill’s initial three criteria for causation. This is another good article about causation which talks about Bradford Hill’s nine criteria for epidemiological causation and how they have changed since they were first proposed in 1965. Finally, if you are looking for help interpreting all of the values found in these regression summaries this video is an excellent, albeit lengthy, introduction.

Hopefully this post has given you some good tools for calculating and verifying the p-values for your analysis and can get you one step closer to not just predicting, but drawing real and actionable conclusions from your data. Thanks for reading, for source code see here, and feel free to comment or reach out to me on LinkedIn if you have questions.

Source Code and References:

Determining Statistical Significance For The Pearson Correlation Coefficient

2016 Excel Regression Analysis

Which Function Does Logistic Regression Use?

Establishing Causality Using Longitudinal Hierarchical Linear Modeling

Applying the Bradford Hill criteria in the 21st century

Regression Output Explained

Source Code

Statology — Here is How to Find the P-Value from the t-Distribution Table

Statology — How to Get Regression Model Summary from SciKit-Learn

Statology — How to Perform Logistic Regression in Excel

--

--

Ira Seidman
Ira Seidman

Written by Ira Seidman

Ira Seidman is a freelance data analyst, feel free to check out my Fiverr gig and LinkedIn. https://www.fiverr.com/ira_seidman/analyze-your-data-in-python

No responses yet