特征工程(IV)--特征选择

特征工程

有这么一句话在业界广泛流传:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。由此可见,特征工程在机器学习中占有相当重要的地位。在实际应用当中,可以说特征工程是机器学习成功的关键。

特征工程是数据分析中最耗时间和精力的一部分工作,它不像算法和模型那样是确定的步骤,更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。

特征工程包含了 Data PreProcessing(数据预处理)、Feature Extraction(特征提取)、Feature Selection(特征选择)和 Feature construction(特征构造)等子问题。

  • 特征选择
    • 前言
    • 单变量特征选择
      • 互信息
      • 相关系数
      • 方差分析
      • 卡方检验
      • IV值
      • 基尼系数
      • VIF值
      • Pipeline实现
    • 递归消除特征
    • 特征重要性
    • 主成分分析
    • 小结

导入必要的包

import numpy as np
import pandas as pd
import re
import sys
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector
from sklearn.pipeline import FeatureUnion, make_union, Pipeline, make_pipeline
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.feature_selection import SelectFpr, SelectFdr, SelectFwe
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc# Setting configuration.
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')SEED = 42

定义一个计时器

def timer(func):import timeimport functoolsdef strfdelta(tdelta, fmt):hours, remainder = divmod(tdelta, 3600)minutes, seconds = divmod(remainder, 60)return fmt.format(hours, minutes, seconds)@functools.wraps(func)def wrapper(*args, **kwargs):click = time.time()print("Starting time\t", time.strftime("%H:%M:%S", time.localtime()))result = func(*args, **kwargs)delta = strfdelta(time.time() - click, "{:.0f} hours {:.0f} minutes {:.0f} seconds")print(f"{func.__name__} cost {delta}")return resultreturn wrapper

特征选择

前言

现在我们已经有大量的特征可使用,有的特征携带的信息丰富,有的特征携带的信息有重叠,有的特征则属于无关特征,尽管在拟合一个模型之前很难说哪些特征是重要的,但如果所有特征不经筛选地全部作为训练特征,经常会出现维度灾难问题,甚至会降低模型的泛化性能(因为较无益的特征会淹没那些更重要的特征)。因此,我们需要进行特征筛选,排除无效/冗余的特征,把有用的特征挑选出来作为模型的训练数据。

特征选择方法有很多,一般分为三类:

  • 过滤法(Filter)比较简单,它按照特征的发散性或者相关性指标对各个特征进行评分,设定评分阈值或者待选择阈值的个数,选择合适特征。
  • 包装法(Wrapper)根据目标函数,通常是预测效果评分,每次选择部分特征,或者排除部分特征。
  • 嵌入法(Embedded)则稍微复杂一点,它先使用选择的算法进行训练,得到各个特征的权重,根据权重从大到小来选择特征。
sklearn.feature_selection所属方法说明
VarianceThresholdFilter方差选择法
SelectKBestFilter常用相关系数、卡方检验、互信息作为得分计算的方法
SelectPercentileFilter根据最高分数的百分位数选择特征
SelectFpr, SelectFdr, SelectFweFilter根据假设检验的p-value选择特征
RFECVWrapper在交叉验证中执行递归式特征消除
SequentialFeatureSelectorWrapper前向/向后搜索
SelectFromModelEmbedded训练基模型,选择权值系数较高的特征

如何通俗地理解Family-wise error rate(FWER)和False discovery rate(FDR)

df = pd.read_csv('../datasets/Home-Credit-Default-Risk/created_data.csv', index_col='SK_ID_CURR')

定义帮助节省内存的函数

@timer
def convert_dtypes(df, verbose=True):original_memory = df.memory_usage().sum()df = df.apply(pd.to_numeric, errors='ignore')# Convert booleans to integersboolean_features = df.select_dtypes(bool).columns.tolist()df[boolean_features] = df[boolean_features].astype(np.int32)# Convert objects to categoryobject_features = df.select_dtypes(object).columns.tolist()df[object_features] = df[object_features].astype('category')# Float64 to float32float_features = df.select_dtypes(float).columns.tolist()df[float_features] = df[float_features].astype(np.float32)# Int64 to int32int_features = df.select_dtypes(int).columns.tolist()df[int_features] = df[int_features].astype(np.int32)new_memory = df.memory_usage().sum()if verbose:print(f'Original Memory Usage: {round(original_memory / 1e9, 2)} gb.')print(f'New Memory Usage: {round(new_memory / 1e9, 2)} gb.')return df
df = convert_dtypes(df)
X = df.drop('TARGET', axis=1) 
y = df['TARGET']
Starting time	 20:30:38
Original Memory Usage: 5.34 gb.
New Memory Usage: 2.63 gb.
convert_dtypes cost 0 hours 2 minutes 1 seconds
X.dtypes.value_counts()
float32     2104
int32         16
category       3
category       3
category       3
category       3
category       3
category       3
category       3
category       2
category       2
category       2
category       2
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
category       1
Name: count, dtype: int64
del df
gc.collect()
0
# Encode categorical features
categorical_features = X.select_dtypes(exclude='number').columns.tolist()
X[categorical_features] = X[categorical_features].apply(lambda x: x.cat.codes)
X.dtypes.value_counts()
float32    2104
int8         47
int32        16
Name: count, dtype: int64

定义数据集评估函数

@timer
def score_dataset(X, y, categorical_features, nfold=5):# Create Dataset object for lightgbmdtrain = lgb.Dataset(X, label=y)#  Use a dictionary to set Parameters.params = dict(objective='binary',is_unbalance=True,metric='auc',n_estimators=500,verbose=0)# Training with 5-fold CV:print('Starting training...')eval_results = lgb.cv(params, dtrain, nfold=nfold,categorical_feature = categorical_features,callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)],return_cvbooster=True)boosters = eval_results['cvbooster'].boosters# Initialize an empty dataframe to hold feature importancesfeature_importances = pd.DataFrame(index=X.columns)for i in range(nfold):# Record the feature importancesfeature_importances[f'cv_{i}'] = boosters[i].feature_importance()feature_importances['score'] = feature_importances.mean(axis=1)# Sort features according to importancefeature_importances = feature_importances.sort_values('score', ascending=False)return eval_results, feature_importances
eval_results, feature_importances = score_dataset(X, y, categorical_features)
Starting time	 20:32:42
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.778018 + 0.00319843
[100]	cv_agg's valid auc: 0.783267 + 0.00307558
[150]	cv_agg's valid auc: 0.783211 + 0.00299384
Early stopping, best iteration is:
[115]	cv_agg's valid auc: 0.783392 + 0.00298777
score_dataset cost 0 hours 6 minutes 9 seconds

单变量特征选择

Relief(Relevant Features)是著名的过滤式特征选择方法。该方法假设特征子集的重要性是由子集中的每个特征所对应的相关统计量分量之和所决定的。所以只需要选择前k个大的相关统计量对应的特征,或者大于某个阈值的相关统计量对应的特征即可。

常用的过滤指标:

函数python模块说明
VarianceThresholdsklearn.feature_selection方差过滤
r_regressionsklearn.feature_selection回归任务的目标/特征之间的Pearson相关系数。
f_regressionsklearn.feature_selection回归任务的目标/特征之间的t检验F值。
mutual_info_regressionsklearn.feature_selection估计连续目标变量的互信息。
chi2sklearn.feature_selection分类任务的非负特征的卡方值和P值。
f_classifsklearn.feature_selection分类任务的目标/特征之间的方差分析F值。
mutual_info_classifsklearn.feature_selection估计离散目标变量的互信息。
df.corr, df.corrwithpandasPearson, Kendall, Spearman相关系数
calc_gini_scoresself-define基尼系数
variance_inflation_factorstatsmodels.stats.outliers_influenceVIF值
df.isna().mean()pandas缺失率
DropCorrelatedFeaturesfeature_engine.selection删除共线特征,基于相关系数
SelectByInformationValuefeature_engine.selectionIV值筛选
DropHighPSIFeaturesfeature_engine.selection删除不稳定特征

互信息

互信息是从信息熵的角度分析各个特征和目标之间的关系(包括线性和非线性关系)。

from sklearn.feature_selection import SelectKBest, mutual_info_classif@timer
def calc_mi_scores(X, y):colnames = X.select_dtypes(exclude='number').columnsX[colnames] = X[colnames].astype("category").apply(lambda x:x.cat.codes)discrete = [X[col].nunique()<=50 for col in X]mi_scores = mutual_info_classif(X, y, discrete_features=discrete, random_state=SEED)mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)mi_scores = mi_scores.sort_values(ascending=False)return mi_scores
class DropUninformative(BaseEstimator, TransformerMixin):def __init__(self, threshold=0.0):self.threshold = thresholddef fit(self, X, y):mi_scores = calc_mi_scores(X, y)self.variables = mi_scores[mi_scores > self.threshold].index.tolist()return selfdef transform(self, X, y=None):  return X[self.variables]def get_feature_names_out(self, input_features=None):return self.variables
init_n = len(X.columns)
selected_features = DropUninformative(threshold=0.0) \.fit(X, y) \.get_feature_names_out()print('The number of selected features:', len(selected_features))
print(f'Dropped {init_n - len(selected_features)} uninformative features.')
Starting time	 20:38:51
calc_mi_scores cost 0 hours 17 minutes 49 seconds
The number of selected features: 2050
Dropped 117 uninformative features.
selected_categorical_features = [col for col in categorical_features if col in selected_features]
eval_results, feature_importances = score_dataset(X[selected_features], y, selected_categorical_features)
Starting time	 20:56:46
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.778311 + 0.00276223
[100]	cv_agg's valid auc: 0.783085 + 0.00266899
[150]	cv_agg's valid auc: 0.783015 + 0.00280856
Early stopping, best iteration is:
[122]	cv_agg's valid auc: 0.783271 + 0.00267406
score_dataset cost 0 hours 6 minutes 7 seconds

相关系数

皮尔森相关系数是一种最简单的方法,能帮助理解两个连续变量之间的线性相关性。

import timedef progress(percent=0, width=50, desc="Processing"):import mathtags = math.ceil(width * percent) * "#"print(f"\r{desc}: [{tags:-<{width}}]{percent:.1%}", end="", flush=True)
@timer
def drop_correlated_features(X, y, threshold=0.9):to_keep = []to_drop = []categorical = X.select_dtypes(exclude='number').columns.tolist()for i, col in enumerate(X.columns):if col in categorical:continue# The correlationscorr = X[to_keep].corrwith(X[col]).abs()# Select columns with correlations above thresholdif any(corr > threshold):to_drop.append(col)else:to_keep.append(col)progress((i+1) / len(X.columns))print("\nThe number of correlated features:", len(to_drop))return to_keep

上述函数会倾向于删除最后出现的相关特征,为了尽可能保留原始特征,我们调换下特征顺序:

original_df = pd.read_csv('../datasets/Home-Credit-Default-Risk/prepared_data.csv', nrows=5)original_features = [f for f in X.columns if f in original_df.columns]
derived_features =  [f for f in X.columns if f not in original_df.columns]selected_features = [col for col in original_features + derived_features  if col in selected_features]
# Drops features that are correlated# init_n = len(selected_features)
selected_features = drop_correlated_features(X[selected_features], y, threshold=0.9) print('The number of selected features:', len(selected_features))
print(f'Dropped {init_n - len(selected_features)} correlated features.')
Starting time	 21:03:05
Processing: [##################################################]100.0%
The number of correlated features: 1110
drop_correlated_features cost 0 hours 33 minutes 5 seconds
The number of selected features: 940
Dropped 1227 correlated features.

工作中,我们常调用feature_engine包实现:

# Drops features that are correlated
# from feature_engine.selection import DropCorrelatedFeatures# init_n = len(selected_features)
# selected_features = DropCorrelatedFeatures(threshold=0.9) \
#     .fit(X[selected_features], y) \
#     .get_feature_names_out() # print('The number of selected features:', len(selected_features))
# print(f'Dropped {init_n - len(selected_features)} features.')
selected_categorical_features = [col for col in categorical_features if col in selected_features]
eval_results, feature_importances = score_dataset(X[selected_features], y, selected_categorical_features)
Starting time	 21:36:12
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.776068 + 0.00333724
[100]	cv_agg's valid auc: 0.781097 + 0.00296053
[150]	cv_agg's valid auc: 0.781236 + 0.00298245
Early stopping, best iteration is:
[136]	cv_agg's valid auc: 0.781375 + 0.00302538
score_dataset cost 0 hours 2 minutes 23 seconds

方差分析

方差分析主要用于分类问题中连续特征的相关性。

from sklearn.feature_selection import f_classif  numeric_features = [col for col in X.columns if col not in categorical_features]f_statistic, p_values = f_classif(X[numeric_features], y)
anova = pd.DataFrame({"f_statistic": f_statistic,"p_values": p_values}, index=numeric_features
)
print(f"The number of irrelevant features for classification:", anova['p_values'].ge(0.05).sum())
The number of irrelevant features for classification: 274

卡方检验

卡方检验是一种用于衡量两个分类变量之间相关性的统计方法。

from sklearn.feature_selection import chi2chi2_stats, p_values = chi2(X[categorical_features], y
)
chi2_test = pd.DataFrame({"chi2_stats": chi2_stats,"p_values": p_values}, index=categorical_features
)
print("The number of irrelevant features for classification:", chi2_test['p_values'].ge(0.05).sum())
The number of irrelevant features for classification: 9

如果针对分类问题,f_classif 和 chi2两个评分函数搭配使用,就能够完成一次完整的特征筛选,其中f_classif用于筛选连续特征,chi2用于筛选离散特征。

feature_selection = make_column_transformer((SelectFdr(score_func=f_classif, alpha=0.05), numeric_features),(SelectFdr(score_func=chi2, alpha=0.05), categorical_features),verbose=True,verbose_feature_names_out=False
)selected_features_by_fdr = feature_selection.fit(X, y).get_feature_names_out()
print("The number of selected features:", len(selected_features_by_fdr))
print("Dropped {} features.".format(X.shape[1] - len(selected_features_by_fdr)))
[ColumnTransformer] ... (1 of 2) Processing selectfdr-1, total= 2.7min
[ColumnTransformer] ... (2 of 2) Processing selectfdr-2, total=   0.1s
The number of selected features: 1838
Dropped 329 features.
selected_categorical_features_by_fdr = [col for col in categorical_features if col in selected_features_by_fdr]
eval_results, feature_importances = score_dataset(X[selected_features_by_fdr], y, selected_categorical_features_by_fdr)
Starting time	 21:44:08
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.777829 + 0.00296151
[100]	cv_agg's valid auc: 0.782637 + 0.00263458
[150]	cv_agg's valid auc: 0.782612 + 0.0023263
Early stopping, best iteration is:
[129]	cv_agg's valid auc: 0.782834 + 0.00242003
score_dataset cost 0 hours 5 minutes 41 seconds

IV值

IV(Information Value)用来评价离散特征对二分类变量的预测能力。一般认为IV小于0.02的特征为无用特征。

def calc_iv_scores(X, y, bins=10):X = pd.DataFrame(X)y = pd.Series(y)assert y.nunique() == 2, "y must be binary"iv_scores = pd.Series()# Find discrete featurescolnames = X.select_dtypes(exclude='number').columnsX[colnames] = X[colnames].astype("category").apply(lambda x:x.cat.codes)discrete = [X[col].nunique()<=50 for col in X]# Compute information valuefor colname in X.columns:if colname in discrete:var = X[colname]else:var = pd.qcut(X[colname], bins, duplicates="drop")grouped = y.groupby(var).agg([('Positive','sum'),('All','count')]) grouped['Negative'] = grouped['All']-grouped['Positive'] grouped['Positive rate'] = grouped['Positive']/grouped['Positive'].sum()grouped['Negative rate'] = grouped['Negative']/grouped['Negative'].sum()grouped['woe'] = np.log(grouped['Positive rate']/grouped['Negative rate'])grouped['iv'] = (grouped['Positive rate']-grouped['Negative rate'])*grouped['woe']grouped.name = colname iv_scores[colname] = grouped['iv'].sum()return iv_scores.sort_values(ascending=False)iv_scores = calc_iv_scores(X, y)
print(f"There are {iv_scores.le(0.02).sum()} features with iv <=0.02.")
There are 987 features with iv <=0.02.

基尼系数

基尼系数用来衡量分类问题中特征对目标变量的影响程度。它的取值范围在0到1之间,值越大表示特征对目标变量的影响越大。常见的基尼系数阈值为0.02,如果基尼系数小于此阈值,则被认为是不重要的特征。

def calc_gini_scores(X, y, bins=10):X = pd.DataFrame(X)y = pd.Series(y)gini_scores = pd.Series()# Find discrete featurescolnames = X.select_dtypes(exclude='number').columnsX[colnames] = X[colnames].astype("category").apply(lambda x:x.cat.codes)discrete = [X[col].nunique()<=50 for col in X]# Compute gini scorefor colname in X.columns:if colname in discrete:var = X[colname]else:var = pd.qcut(X[colname], bins, duplicates="drop")p = y.groupby(var).mean()gini = 1 - p.pow(2).sum()gini_scores[colname] = gini    return gini_scores.sort_values(ascending=False)gini_scores = calc_gini_scores(X, y)
print(f"There are {gini_scores.le(0.02).sum()} features with iv <=0.02.")
There are 0 features with iv <=0.02.

VIF值

VIF用于衡量特征之间的共线性程度。通常,VIF小于5被认为不存在多重共线性问题,VIF大于10则存在明显的多重共线性问题。

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant@timer
def calc_vif_scores(X, y=None):numeric_X = X.select_dtypes("number")numeric_X = add_constant(numeric_X)# collinear featuresvif = pd.Series()for i, col in enumerate(numeric_X.columns):vif[col] = variance_inflation_factor(numeric_X.values, i)progress((i+1)/numeric_X.shape[1])return vif.sort_values(ascending=False)# vif_scores = calc_vif_scores(X)
# print(f"There are {vif_scores.gt(10).sum()} collinear features (VIF above 10)")

Pipeline实现

我们准备采取以下两步:

  • 首先,删除互信息为0的特征。
  • 然后,对于相关性大于0.9的每对特征,删除其中一个特征。
from feature_engine.selection import DropCorrelatedFeaturesfeature_selection = make_pipeline(DropUninformative(threshold=0.0),DropCorrelatedFeatures(threshold=0.9),verbose=True
)# init_n = len(X.columns)
# selected_features = feature_selection.fit(X, y).get_feature_names_out()# print('The number of selected features:', len(selected_features))
# print(f'Dropped {init_n - len(selected_features)} features.')

在2167个总特征中只保留了914个,表明我们创建的许多特征是多余的。

递归消除特征

最常用的包装法是递归消除特征法(recursive feature elimination)。递归消除特征法使用一个机器学习模型来进行多轮训练,每轮训练后,消除最不重要的特征,再基于新的特征集进行下一轮训练。

由于RFE需要消耗大量的资源,就不再运行了,代码如下:

# from sklearn.svm import LinearSVC
# from sklearn.feature_selection import RFECV# Use SVM as the model
# svc = LinearSVC(dual="auto", penalty="l1")# Recursive feature elimination with cross-validation to select features.
# rfe = RFECV(svc, step=1, cv=5, verbose=1)
# rfe.fit(X, y)# The mask of selected features.
# print(zip(X.columns, rfe.support_))
# print("The number of features:", rfe.n_features_in_)
# print("The number of selected features:", rfe.n_features_)# feature_rank = pd.Series(rfe.ranking_, index=X.columns).sort_values(ascending=False)
# print("Features sorted by their rank:", feature_rank[:10], sep="\n")

特征重要性

嵌入法也是用模型来选择特征,但是它和RFE的区别是它不通过不停的筛掉特征来进行训练,而是使用特征全集训练模型。

  • 最常用的是使用带惩罚项( ℓ 1 , ℓ 2 \ell_1,\ell_2 1,2 正则项)的基模型,来选择特征,例如 Lasso,Ridge。
  • 或者简单的训练基模型,选择权重较高的特征。

我们先使用之前定义的 score_dataset 获取每个特征的重要性分数:

selected_categorical_features = [col for col in categorical_features if col in selected_features]
eval_results, feature_importances = score_dataset(X[selected_features], y, selected_categorical_features)
Starting time	 21:55:44
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.776068 + 0.00333724
[100]	cv_agg's valid auc: 0.781097 + 0.00296053
[150]	cv_agg's valid auc: 0.781236 + 0.00298245
Early stopping, best iteration is:
[136]	cv_agg's valid auc: 0.781375 + 0.00302538
score_dataset cost 0 hours 2 minutes 25 seconds
# Sort features according to importance
feature_importances = feature_importances.sort_values('score', ascending=False)
feature_importances['score'].head(15)
AMT_ANNUITY / AMT_CREDIT                           92.6
MODE(previous.PRODUCT_COMBINATION)                 62.8
EXT_SOURCE_2 + EXT_SOURCE_3                        60.4
MODE(installments.previous.PRODUCT_COMBINATION)    52.0
MAX(bureau.DAYS_CREDIT)                            40.4
DAYS_BIRTH / EXT_SOURCE_1                          38.4
MAX(bureau.DAYS_CREDIT_ENDDATE)                    35.8
SUM(bureau.AMT_CREDIT_MAX_OVERDUE)                 34.2
MEAN(bureau.AMT_CREDIT_SUM_DEBT)                   34.0
AMT_GOODS_PRICE / AMT_ANNUITY                      30.6
MODE(cash.previous.PRODUCT_COMBINATION)            29.8
MAX(cash.previous.DAYS_LAST_DUE_1ST_VERSION)       29.8
SUM(bureau.AMT_CREDIT_SUM)                         29.0
MEAN(previous.MEAN(cash.CNT_INSTALMENT_FUTURE))    29.0
AMT_CREDIT - AMT_GOODS_PRICE                       28.2
Name: score, dtype: float64

可以看到,我们构建的许多特征进入了前15名,这应该让我们有信心,我们所有的辛勤工作都是值得的!

接下来,我们删除重要性为0的特征,因为这些特征实际上从未用于在任何决策树中拆分节点。因此,删除这些特征是一个非常安全的选择(至少对这个特定模型来说)。

# Find the features with zero importance
zero_features = feature_importances.query("score == 0.0").index.tolist()
print(f'\nThere are {len(zero_features)} features with 0.0 importance')
There are 105 features with 0.0 importance
selected_features = [col for col in selected_features if col not in zero_features]
print("The number of selected features:", len(selected_features))
print("Dropped {} features with zero importance.".format(len(zero_features)))
The number of selected features: 835
Dropped 105 features with zero importance.
selected_categorical_features = [col for col in categorical_features if col in selected_features]
eval_results, feature_importances = score_dataset(X[selected_features], y, selected_categorical_features)
Starting time	 21:58:13
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.77607 + 0.00333823
[100]	cv_agg's valid auc: 0.781042 + 0.00295406
[150]	cv_agg's valid auc: 0.781317 + 0.00303434
[200]	cv_agg's valid auc: 0.780819 + 0.00281177
Early stopping, best iteration is:
[154]	cv_agg's valid auc: 0.781405 + 0.0029417
score_dataset cost 0 hours 2 minutes 34 seconds

删除0重要性的特征后,我们还有834个特征。如果我们认为此时特征量依然非常大,我们可以继续删除重要性最小的特征。
下图显示了累积重要性与特征数量:

feature_importances = feature_importances.sort_values('score', ascending=False)sns.lineplot(x=range(1, feature_importances.shape[0]+1), y=feature_importances['score'].cumsum())
plt.show()


外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

如果我们选择是只保留95%的重要性所需的特征:

def select_import_features(scores, thresh=0.95):feature_imp = pd.DataFrame(scores, columns=['score'])# Sort features according to importancefeature_imp = feature_imp.sort_values('score', ascending=False)# Normalize the feature importancesfeature_imp['score_normalized'] = feature_imp['score'] / feature_imp['score'].sum()feature_imp['cumsum'] = feature_imp['score_normalized'].cumsum()selected_features = feature_imp.query(f'cumsum >= {thresh}')return selected_features.index.tolist()init_n = len(selected_features)
import_features = select_import_features(feature_importances['score'], thresh=0.95)
print("The number of import features:", len(import_features))
print(f'Dropped {init_n - len(import_features)} features.')
The number of import features: 241
Dropped 594 features.

剩余248个特征足以覆盖95%的重要性。

import_categorical_features = [col for col in categorical_features if col in import_features]
eval_results, feature_importances = score_dataset(X[import_features], y, import_categorical_features)
Starting time	 22:00:49
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.756425 + 0.0029265
[100]	cv_agg's valid auc: 0.759284 + 0.0029921
[150]	cv_agg's valid auc: 0.759162 + 0.00314089
Early stopping, best iteration is:
[115]	cv_agg's valid auc: 0.759352 + 0.00300464
score_dataset cost 0 hours 0 minutes 21 seconds

在继续之前,我们应该记录我们采取的特征选择步骤,以备将来使用:

  1. 删除互信息为0的无效特征:删除了117个特征
  2. 删除相关系数大于0.9的共线变量:删除了1108个特征
  3. 根据GBM删除0.0重要特征:删除108个特征
  4. (可选)仅保留95%特征重要性所需的特征:删除了586个特征

我们看下特征组成:

original = set(original_features) & set(import_features)
derived = set(import_features) - set(original)print(f"Selected features: {len(original)} original features, {len(derived)} derived features.")
Selected features: 33 original features, 208 derived features.

保留的248个特征,有37个是原始特征,211个是衍生特征。

主成分分析

常见的降维方法除了基于L1惩罚项的模型以外,另外还有主成分分析法(PCA)和线性判别分析(LDA)。这两种方法的本质是相似的,本节主要介绍PCA。

方法函数python包
主成分分析法PCAsklearn.decomposition
线性判别分析法LinearDiscriminantAnalysissklearn.discriminant_analysis

应用主成分分析

from sklearn.decomposition import PCA
from sklearn.preprocessing import RobustScaler# Standardize
pca = Pipeline([('standardize', RobustScaler()),('pca', PCA(n_components=None, random_state=SEED)),], verbose=True
)principal_components = pca.fit_transform(X)
weight_matrix = pca['pca'].components_
[Pipeline] ....... (step 1 of 2) Processing standardize, total= 1.1min
[Pipeline] ............... (step 2 of 2) Processing pca, total=11.8min

其中 pca.components_ 对应sklearn 中 PCA 求解矩阵的SVD分解的截断的 V T V^T VT。在PCA转换后,pca.components_ 是一个具有形状为 (n_components, n_features) 的数组,其中 n_components 是我们指定的主成分数目,n_features 是原始数据的特征数目。pca.components_ 的每一行表示一个主成分,每一列表示原始数据的一个特征。因此,pca.components_ 的每个元素表示对应特征在主成分中的权重。

可视化方差

def plot_variance(pca, n_components=10):evr = pca.explained_variance_ratio_[:n_components]grid = range(1, n_components + 1)# Create figureplt.figure(figsize=(6, 4))# Percentage of variance explained for each components.plt.bar(grid, evr, label='Explained Variance')# Cumulative Varianceplt.plot(grid, np.cumsum(evr), "o-", label='Cumulative Variance', color='orange')  plt.xlabel("The number of Components")plt.xticks(grid)plt.title("Explained Variance Ratio")plt.ylim(0.0, 1.1)plt.legend(loc='best')plot_variance(pca['pca'])
plt.show()


外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

我们使用pca前两个主成分进行可视化:

print("explained variance ratio (first two components): %s"% str(pca['pca'].explained_variance_ratio_[:2])
) sns.kdeplot(x=principal_components[:, 0], y=principal_components[:, 1], hue=y)
plt.xlim(-1e8, 1e8)
plt.ylim(-1e8, 1e8)
explained variance ratio (first two components): [0.43424749 0.33590885](-100000000.0, 100000000.0)


外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

这两个类别没有完全分开,因此我们需要更多的主成分。

PCA可以有效地减少维度的数量,但他们的本质是要将原始的样本映射到维度更低的样本空间中。这意味着PCA特征没有真正的业务含义。此外,PCA假设数据是正态分布的,这可能不是真实数据的有效假设。因此,我们只是展示了如何使用pca,实际上并没有将其应用于数据。

小结

本章介绍了很多特征选择方法

  1. 单变量特征选择可以用于理解数据、数据的结构、特点,也可以用于排除不相关特征,但是它不能发现冗余特征。
  2. 正则化的线性模型可用于特征理解和特征选择。但是它需要先把特征装换成正态分布。
  3. 嵌入法的特征重要性选择是一种非常流行的特征选择方法,它易于使用。但它有两个主要问题:
    • 重要的特征有可能得分很低(关联特征问题)
    • 这种方法对类别多的特征越有利(偏向问题)

至此,经典的特征工程至此已经完结了,我们继续使用LightGBM模型评估筛选后的特征。

eval_results, feature_importances = score_dataset(X[selected_features], y, selected_categorical_features)
Starting time	 22:14:44
Starting training...
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.77607 + 0.00333823
[100]	cv_agg's valid auc: 0.781042 + 0.00295406
[150]	cv_agg's valid auc: 0.781317 + 0.00303434
[200]	cv_agg's valid auc: 0.780819 + 0.00281177
Early stopping, best iteration is:
[154]	cv_agg's valid auc: 0.781405 + 0.0029417
score_dataset cost 0 hours 2 minutes 25 seconds

特征重要性:

# Sort features according to importance
feature_importances['score'].sort_values(ascending=False).head(15)
AMT_ANNUITY / AMT_CREDIT                           98.6
EXT_SOURCE_2 + EXT_SOURCE_3                        66.6
MODE(previous.PRODUCT_COMBINATION)                 66.0
MODE(installments.previous.PRODUCT_COMBINATION)    54.4
MAX(bureau.DAYS_CREDIT)                            41.8
MAX(bureau.DAYS_CREDIT_ENDDATE)                    40.0
DAYS_BIRTH / EXT_SOURCE_1                          39.8
MEAN(bureau.AMT_CREDIT_SUM_DEBT)                   37.2
SUM(bureau.AMT_CREDIT_MAX_OVERDUE)                 35.2
MODE(cash.previous.PRODUCT_COMBINATION)            33.4
AMT_GOODS_PRICE / AMT_ANNUITY                      33.0
SUM(bureau.AMT_CREDIT_SUM)                         30.8
MAX(cash.previous.DAYS_LAST_DUE_1ST_VERSION)       30.8
MEAN(previous.MEAN(cash.CNT_INSTALMENT_FUTURE))    29.8
AMT_CREDIT - AMT_GOODS_PRICE                       29.2
Name: score, dtype: float64
del X, y
gc.collect()df = pd.read_csv('../datasets/Home-Credit-Default-Risk/created_data.csv', index_col='SK_ID_CURR')
selected_data = df[selected_features + ['TARGET']]
selected_data.to_csv('../datasets/Home-Credit-Default-Risk/selected_data.csv', index=True)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/618772.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

ssm051网上医院预约挂号系统+jsp

网上医院预约挂号系统设计与实现 摘 要 如今的信息时代&#xff0c;对信息的共享性&#xff0c;信息的流通性有着较高要求&#xff0c;因此传统管理方式就不适合。为了让医院预约挂号信息的管理模式进行升级&#xff0c;也为了更好的维护医院预约挂号信息&#xff0c;网上医院…

Echarts简单的多表联动效果和添加水印和按钮切换数据效果

多表联动 多表联动效果指的是在多个表格之间建立一种交互关系&#xff0c;以便它们之间的操作或选择能够相互影响。通常情况下&#xff0c;多表联动效果可以通过以下方式之一实现&#xff1a; 数据关联&#xff1a; 当在一个表格中选择或操作某些数据时&#xff0c;另一个表格…

matlab学习(三)(4.9-4.15)

一、空域里LSB算法的原理 1.原理&#xff1a; LSB算法通过替换图像像素的最低位来嵌入信息。这些被替换的LSB序列可以是需要加入的水印信息、水印的数字摘要或者由水印生成的伪随机序列。 2.实现步骤&#xff1a; &#xff08;1&#xff09;将图像文件中的所有像素点以RGB形…

域权限维持—黄金票据和白金票据

黄金票据和白金票据 前言 某老哥的一次面试里问到了这个问题&#xff0c;故来做一番了解 该攻击方式在BlackHat 2014被提出&#xff0c;演讲者为Alva Duckwall & Benjamin Delpy&#xff08;gentilkiwi)进行了演示&#xff0c;该演讲提出了Kerberos协议实现过程中的设计…

【5G PHY】5G无线链路监测原理简述

博主未授权任何人或组织机构转载博主任何原创文章&#xff0c;感谢各位对原创的支持&#xff01; 博主链接 本人就职于国际知名终端厂商&#xff0c;负责modem芯片研发。 在5G早期负责终端数据业务层、核心网相关的开发工作&#xff0c;目前牵头6G算力网络技术标准研究。 博客…

bugku-web-文件包含2

页面源码 <!-- upload.php --><!doctype html><html><head><meta charset"utf-8"/><meta http-equiv"X-UA-Compatible" content"IEedge"><meta name"viewport" content"widthdevice-widt…

精品方案- 智慧养殖业IOT项目技术建议书(免费下载)

本项目建设从实际需求出发&#xff0c;利用物联网信息化手段进行畜牧业经济运行监测&#xff0c;掌握畜牧业生产与畜牧业经济运行的动态&#xff0c;监测畜牧业生产经营的成本收益变化&#xff0c;对畜牧业生产经营活动提供分析。提高畜牧业市场监管的电子化、网络化水平&#…

Unity上接入手柄,手柄控制游戏物体移动

1、unity软件上安装system input 组件。菜单栏【window】-【Packag Manager】打开如下界面,查找Input System,并且安装。 2、安装成功后插入手柄到windows上,打开菜单栏上【window】--【Analysis】--【Input Debuger】 进入Input Debug界面,可以看到手柄设备能被Unity识别。…

Python学习笔记16 - 函数

函数的创建和调用 函数调用的参数传递 函数的返回值 函数的参数定义 变量的作用域 递归函数 斐波那契数列 总结

详解拷贝构造

拷贝构造的功能 写法&#xff1a; 拷贝构造函数的参数为什么是引用类型 系统自动生成的拷贝构造函数 拷贝构造的深拷贝与浅拷贝 概念 浅拷贝&#xff1a; 深拷贝 小结 拷贝构造的功能 拷贝构造函数可以把曾经实例化好的对象的数据拷贝给新创建的数据 &#xff0c;可见…

架构师系列-搜索引擎ElasticSearch(八)- 集群管理故障恢复

故障转移 集群的master节点会监控集群中的节点状态&#xff0c;如果发现有节点宕机&#xff0c;会立即将宕机节点的分片数据迁移到其它节点&#xff0c;确保数据安全&#xff0c;这个叫做故障转移。 下图中node1是主节点&#xff0c;其他两个节点是从节点 节点故障 此时node1…

数模 线性规划模型理论与实践

线性规划模型理论与实践 1.1 线性规划问题 在人们的生产实践中&#xff0c;经常会遇到如何利用现有资源来安排生产&#xff0c;以取得最大经济效益的问题。此类问题构成了运筹学的一个重要分支一数学规划&#xff0c;而线性规划(Linear Programming 简记LP)则是数学规划的一个…