p-value in Machine Learning

In this tutorial, we are going to understand about the most frequently used topic in statistics called p-value. While performing a statistical test how can we know that the result we get is significantly important then we need an important tool in statistics i.e, p-value.

Before we go further we need to know about Null Hypothesis and Alternative Hypothesis as our concept is more related to hypothesis testing.

We can think of the Null Hypothesis as whatever other results we are getting except the old one we simply take it as not enough evidence to reject it i.e. no effect on the original one. In the context of regression, it says there is no relation between dependent and non-dependent variables.

What actually p-value is?

Before knowing p-value we will get some more intuition about null hypothesis and alternate hypothesis. For that, we will briefly understand these terms with an example. Say, we have 2 engines of cars, in this Null hypothesis means says there is no significant difference between them and the Alternate hypothesis is saying that there is a significant difference between them.

Say we formulated a hypothesis called the null hypothesis. See if this can be tested by given data or not, if it can be tested we will test with best-fit statistics test. After testing we will make a decision based on some factor and that factor is called p-value. This factor is used to make a decision whether it supports the null hypothesis or against the null hypothesis.

Some idea of the significant value for p-value 

In general, 0.05 is used as the cutoff or threshold for significance.

This means a pvalue that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

small the p-value, stronger the evidence to reject the Ho.

 

p-value in Regression

What actually we will be looking for is the statistically more significant features data that can help in very good decision making and results of after exploration.

If p-value < 0.05, then those features data are significant for our exploration.
If p-value > 0.05, then those features data may not be significant for us.

For better intuition of p-value in Regression, let’s take an example of Boston datasets in which we have the following features :

from sklearn.datasets import load_boston
boston_dataset = load_boston()

import pandas as pd
data = pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names)
data['PRICE']=boston_dataset.target
data.head()

 

CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATPRICE
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2

 

Just for understanding, we want to estimate price with given features more accurately. Now what we want is the feature data that will significantly take participation in getting good results. For that, we will first find the p-value for each feature data in our decision making.

import statsmodels.api as sm
# If p-value < 0.05 -->Significant
# If p-value > 0.05 -->Not Significant

prices=data['PRICE']
features=data.drop('PRICE', axis=1)

X_train,X_test, Y_train, Y_test = train_test_split(features, prices, test_size = .2, random_state = 10)

x_incl_cons = sm.add_constant(X_train)
model = sm.OLS(Y_train, x_incl_cons)  #ordinary least square
results = model.fit()  #regresssion results

# results.params
# results.pvalues

pd.DataFrame({'coef': results.params , 'pvalue': round(results.pvalues,3)})

 

Result:
coefpvalue
const4.0599440.000
CRIM-0.0106720.000
ZN0.0015790.009
INDUS0.0020300.445
CHAS0.0803310.038
NOX-0.7040680.000
RM0.0734040.000
AGE0.0007630.209
DIS-0.0476330.000
RAD0.0145650.000
TAX-0.0006450.000
PTRATIO-0.0347950.000
B0.0005160.000
LSTAT-0.0313900.000

 

As we have seen above that p-value > 0.05 is insignificant for decision making, So we may not consider them for further processing as they will not have more explanatory power.

If we see in result “INDUS” and “AGE” have p-value > 0.05 so we may not consider them for further processing as these two are not giving us statistically significance.

That’s how we can use p-value in Machine Learning.

 

Thanks for Reading!!

Leave a Reply

Your email address will not be published. Required fields are marked *