p-value in Machine Learning
In this tutorial, we are going to understand about the most frequently used topic in statistics called p-value. While performing a statistical test how can we know that the result we get is significantly important then we need an important tool in statistics i.e, p-value.
Before we go further we need to know about Null Hypothesis and Alternative Hypothesis as our concept is more related to hypothesis testing.
We can think of the Null Hypothesis as whatever other results we are getting except the old one we simply take it as not enough evidence to reject it i.e. no effect on the original one. In the context of regression, it says there is no relation between dependent and non-dependent variables.
What actually p-value is?
Before knowing p-value we will get some more intuition about null hypothesis and alternate hypothesis. For that, we will briefly understand these terms with an example. Say, we have 2 engines of cars, in this Null hypothesis means says there is no significant difference between them and the Alternate hypothesis is saying that there is a significant difference between them.
Say we formulated a hypothesis called the null hypothesis. See if this can be tested by given data or not, if it can be tested we will test with best-fit statistics test. After testing we will make a decision based on some factor and that factor is called p-value. This factor is used to make a decision whether it supports the null hypothesis or against the null hypothesis.
Some idea of the significant value for p-value
In general, 0.05 is used as the cutoff or threshold for significance.
This means a p–value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.
small the p-value, stronger the evidence to reject the Ho.
p-value in Regression
What actually we will be looking for is the statistically more significant features data that can help in very good decision making and results of after exploration.
If p-value < 0.05, then those features data are significant for our exploration.
If p-value > 0.05, then those features data may not be significant for us.
For better intuition of p-value in Regression, let’s take an example of Boston datasets in which we have the following features :
from sklearn.datasets import load_boston boston_dataset = load_boston() import pandas as pd data = pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names) data['PRICE']=boston_dataset.target data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Just for understanding, we want to estimate price with given features more accurately. Now what we want is the feature data that will significantly take participation in getting good results. For that, we will first find the p-value for each feature data in our decision making.
import statsmodels.api as sm # If p-value < 0.05 -->Significant # If p-value > 0.05 -->Not Significant prices=data['PRICE'] features=data.drop('PRICE', axis=1) X_train,X_test, Y_train, Y_test = train_test_split(features, prices, test_size = .2, random_state = 10) x_incl_cons = sm.add_constant(X_train) model = sm.OLS(Y_train, x_incl_cons) #ordinary least square results = model.fit() #regresssion results # results.params # results.pvalues pd.DataFrame({'coef': results.params , 'pvalue': round(results.pvalues,3)})
Result:
coef pvalue const 4.059944 0.000 CRIM -0.010672 0.000 ZN 0.001579 0.009 INDUS 0.002030 0.445 CHAS 0.080331 0.038 NOX -0.704068 0.000 RM 0.073404 0.000 AGE 0.000763 0.209 DIS -0.047633 0.000 RAD 0.014565 0.000 TAX -0.000645 0.000 PTRATIO -0.034795 0.000 B 0.000516 0.000 LSTAT -0.031390 0.000
As we have seen above that p-value > 0.05 is insignificant for decision making, So we may not consider them for further processing as they will not have more explanatory power.
If we see in result “INDUS” and “AGE” have p-value > 0.05 so we may not consider them for further processing as these two are not giving us statistically significance.
That’s how we can use p-value in Machine Learning.
Thanks for Reading!!
Leave a Reply