Scikit-Learn如何做特征选择

简介

特征选择是特征工程相对重要的一环节,太多不相关的特征会降低模型的准确性,特征选择,可以有效降低过拟合、训练时间等问题。本文简单介绍使用Scikit-Learn做特征选择。

Scikit-Learn提供了几种不同的特征选择方法,其一,单变量选择(Univariate Selection);其二,主成分分析(Principal Component Analysis);其三,递归特征排除法(Recursive Feature Elimination);其四,特征重要性排名(feature importance ranking)。

Univariate Selection

原理:使用统计检验方法选择与输出变量最相关的特征。
使用方法:Scikit-Learn库中的SelectKBest类可用不同的统计检验方法挑选指定数量的重要特征。
举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
print(fit.scores_)
#[111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
X_new = fit.transform(X)
print(X.shape) #(768,8)
print(X_new.shape) #(768,4)

Principal Component Analysis

原理:使用线性代数的方式将数据集进行压缩,本质上,是一种降维方法。
举例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print(fit.explained_variance_ratio_)
print(fit.components_)
X_new = fit.transform(X)
print(X_new.shape) #(768,3)

Recursive Feature Elimination (RFE)

原理:递归移除特征,剩余特征重新建模,通过模型准确性识别出贡献度最好的特征。
使用方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_) #[False True True True]
print(rfe.ranking_) #[2 1 1 1]

Feature Importance

方法:Random Forest,Extra Trees等基于决策树的ensemble方法,可以直接得到每一个特征的重要性分值,这个分值可以用来做特征选择。查阅手册,了解更多ExtraTreesClassifier
举例:

1
2
3
4
5
6
7
8
9
10
11
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)
# display the relative importance of each attribute
print(model.feature_importances_)

总之,特征选择主要有3大类方法,过滤(Filter),重组(Wrapper),正则化(如LASSO, Elastic Net and Ridge Regression)。

番外篇

A mistake would be to perform feature selection first to prepare your data, then perform model selection and training on the selected features.

这是进行特征选择,可能会犯的错误,我们需要在交叉验证的内循环中嵌入特征选择,即,对每一折进行特征选择。

另附一份特征选择清单(Checklist):
checklist
文献阅读:An Introduction to Variable and Feature Selection

分享