p-value in Machine Learning

Post Views: 1,071

In this tutorial, we are going to understand about the most frequently used topic in statistics called p-value. While performing a statistical test how can we know that the result we get is significantly important then we need an important tool in statistics i.e, p-value.

Before we go further we need to know about Null Hypothesis and Alternative Hypothesis as our concept is more related to hypothesis testing.

We can think of the Null Hypothesis as whatever other results we are getting except the old one we simply take it as not enough evidence to reject it i.e. no effect on the original one. In the context of regression, it says there is no relation between dependent and non-dependent variables.

What actually p-value is?

Before knowing p-value we will get some more intuition about null hypothesis and alternate hypothesis. For that, we will briefly understand these terms with an example. Say, we have 2 engines of cars, in this Null hypothesis means says there is no significant difference between them and the Alternate hypothesis is saying that there is a significant difference between them.

Say we formulated a hypothesis called the null hypothesis. See if this can be tested by given data or not, if it can be tested we will test with best-fit statistics test. After testing we will make a decision based on some factor and that factor is called p-value. This factor is used to make a decision whether it supports the null hypothesis or against the null hypothesis.

Some idea of the significant value for p-value

In general, 0.05 is used as the cutoff or threshold for significance.

This means a p–value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.
small the p-value, stronger the evidence to reject the Ho.

p-value in Regression

What actually we will be looking for is the statistically more significant features data that can help in very good decision making and results of after exploration.

If p-value < 0.05, then those features data are significant for our exploration.
If p-value > 0.05, then those features data may not be significant for us.

For better intuition of p-value in Regression, let’s take an example of Boston datasets in which we have the following features :

from sklearn.datasets import load_boston
boston_dataset = load_boston()

import pandas as pd
data = pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names)
data['PRICE']=boston_dataset.target
data.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Just for understanding, we want to estimate price with given features more accurately. Now what we want is the feature data that will significantly take participation in getting good results. For that, we will first find the p-value for each feature data in our decision making.

import statsmodels.api as sm
# If p-value < 0.05 -->Significant
# If p-value > 0.05 -->Not Significant

prices=data['PRICE']
features=data.drop('PRICE', axis=1)

X_train,X_test, Y_train, Y_test = train_test_split(features, prices, test_size = .2, random_state = 10)

x_incl_cons = sm.add_constant(X_train)
model = sm.OLS(Y_train, x_incl_cons)  #ordinary least square
results = model.fit()  #regresssion results

# results.params
# results.pvalues

pd.DataFrame({'coef': results.params , 'pvalue': round(results.pvalues,3)})

Result:

coef pvalue

const 4.059944 0.000

CRIM -0.010672 0.000

ZN 0.001579 0.009

INDUS 0.002030 0.445

CHAS 0.080331 0.038

NOX -0.704068 0.000

RM 0.073404 0.000

AGE 0.000763 0.209

DIS -0.047633 0.000

RAD 0.014565 0.000

TAX -0.000645 0.000

PTRATIO -0.034795 0.000

B 0.000516 0.000

LSTAT -0.031390 0.000

	coef	pvalue
const	4.059944	0.000
CRIM	-0.010672	0.000
ZN	0.001579	0.009
INDUS	0.002030	0.445
CHAS	0.080331	0.038
NOX	-0.704068	0.000
RM	0.073404	0.000
AGE	0.000763	0.209
DIS	-0.047633	0.000
RAD	0.014565	0.000
TAX	-0.000645	0.000
PTRATIO	-0.034795	0.000
B	0.000516	0.000
LSTAT	-0.031390	0.000

As we have seen above that p-value > 0.05 is insignificant for decision making, So we may not consider them for further processing as they will not have more explanatory power.

If we see in result “INDUS” and “AGE” have p-value > 0.05 so we may not consider them for further processing as these two are not giving us statistically significance.

That’s how we can use p-value in Machine Learning.

Thanks for Reading!!

p-value in Machine Learning

Leave a Reply Cancel reply