【Kaggle】练习赛《肥胖风险的多类别预测》

前言

作为机器学习的初学者,Kaggle提供了一个很好的练习和学习平台,其中有一个栏目《PLAYGROUND》,可以理解为游乐场系列赛,提供有趣、平易近人的数据集,以练习他们的机器学习技能,并每个月都会有一场比赛。非常适合新手学习的机会,同时会有大量的高手分享其代码,本期是2024年2月份的题目《Multi-Class Prediction of Obesity Risk》即《肥胖风险的多类别预测》,在此我分享在这个比赛过程的点点滴滴。

题目说明

目标: 预测个体的肥胖风险

数据集介绍:
列名完整含义详细说明
‘id’id每个人的唯一号
‘Gender’性别个人性别
‘Age’年龄年龄在14岁到61岁之间
‘Height’高度高度以米为单位,介于1.45米到1.98米之间
‘Weight’体重体重介于39到165之间,单位为 KG.
‘family_history_with_overweight’家族史
是否有超重问题
‘FAVC’消费频率
对于高热量食物
对于高热量食物,这是否定的问题。我想他们问的问题是你吃高热量的食物吗
‘FCVC’蔬菜消费频率类似于FAVC. 的问题
‘NCP’主餐数类型为浮点数, NCP介于1和4之间
应该是1,2,3,4,但由于数据是合成的,因此为浮点数
‘CAEC’消费 两餐之间的食物共有4个值,有时经常从不总是
‘SMOKE’吸烟的问题, 问题应该是“你抽烟吗?”
‘CH2O’每日摄水量取值在1和3之间,同样是生成的数据,所以是浮点类型
‘SCC’热量消耗监控的问题
‘FAF’体育活动频率FAF在0到3之间,0表示没有体力活动, 3表示高强度锻炼
‘TUE’使用技术设备时间TUE在0到2之间。 问题是“你有多长时间一直在使用技术设备来跟踪您的健康状况。“
‘CALC’饮酒量有 3 值,有时从不经常
‘MTRANS’使用的交通工具MTRANS取5个值 公共交通汽车步行摩托车自行车
‘NObeyesdad’目标这是我们的目标,取7个值,在这个竞赛中,我们必须给予类名(不是概率,大多数竞赛中都是这种情况)
NObeyesdad (目标变量):
  • Insufficient_Weight (体重不足) : 小于18.5
  • Normal_Weight (正常体重) : 18.5 到 24.9
  • Obesity_Type_I (肥胖I级) : 30.0 到 34.9
  • Obesity_Type_II (肥胖II级) : 35.0 到 39.9
  • Obesity_Type_III (肥胖III级): 高于 40
  • Overweight_Level_I(超级肥胖I级), Overweight_Level_II (超级肥胖II级)takes values between 25 to 29

加载库

(略)

加载数据

# 加载所有数据
train = pd.read_csv(os.path.join(FILE_PATH, "train.csv"))
test = pd.read_csv(os.path.join(FILE_PATH, "test.csv"))

探索数据

Train Data
Total number of rows: 20758
Total number of columns: 18

Test Data
Total number of rows: 13840
Total number of columns:17

  • 训练数据统计汇总如下
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
| Column Name | count |  dtype  | nunique | %nunique | %null |         min         |         max         |
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
|      id     | 20758 |  int64  |  20758  |  100.0   |  0.0  |          0          |        20757        |
|    Gender   | 20758 |  object |    2    |   0.01   |  0.0  |        Female       |         Male        |
|     Age     | 20758 | float64 |   1703  |  8.204   |  0.0  |         14.0        |         61.0        |
|    Height   | 20758 | float64 |   1833  |   8.83   |  0.0  |         1.45        |       1.975663      |
|    Weight   | 20758 | float64 |   1979  |  9.534   |  0.0  |         39.0        |      165.057269     |
|     FHWO    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FAVC    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FCVC    | 20758 | float64 |   934   |  4.499   |  0.0  |         1.0         |         3.0         |
|     NCP     | 20758 | float64 |   689   |  3.319   |  0.0  |         1.0         |         4.0         |
|     CAEC    | 20758 |  object |    4    |  0.019   |  0.0  |        Always       |          no         |
|    SMOKE    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     CH2O    | 20758 | float64 |   1506  |  7.255   |  0.0  |         1.0         |         3.0         |
|     SCC     | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FAF     | 20758 | float64 |   1360  |  6.552   |  0.0  |         0.0         |         3.0         |
|     TUE     | 20758 | float64 |   1297  |  6.248   |  0.0  |         0.0         |         2.0         |
|     CALC    | 20758 |  object |    3    |  0.014   |  0.0  |      Frequently     |          no         |
|    MTRANS   | 20758 |  object |    5    |  0.024   |  0.0  |      Automobile     |       Walking       |
|  NObeyesdad | 20758 |  object |    7    |  0.034   |  0.0  | Insufficient_Weight | Overweight_Level_II |
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
  • 目标值对性别的分类
gender_count%gender_counttarget_class_count%target_class_count
NObeyesdadGender
Insufficient_WeightFemale16210.6425230.12
Male9020.3625230.12
Normal_WeightFemale16600.5430820.15
Male14220.4630820.15
Obesity_Type_IFemale12670.4429100.14
Male16430.5629100.14
Obesity_Type_IIFemale80.0032480.16
Male32401.0032480.16
Obesity_Type_IIIFemale40411.0040460.19
Male50.0040460.19
Overweight_Level_IFemale10700.4424270.12
Male13570.5624270.12
Overweight_Level_IIFemale7550.3025220.12
Male17670.7025220.12
从上表中,我们可以看到
  • Obesity_Type_II中的所有人都是男性,Obesity_Type_III中的所有人为女性
  • Overweight_Level_II由70%的男性组成,Insufficient_Weight由60%以上的女性组成
  • 从这一点我们可以说,性别是肥胖预测的一个重要特征

数据可视化

在本节中,我们将看到:

  • 单个数值图
  • 个体分类图
  • 数值相关图
  • 组合数字图
目标分布与性别
fig, axs = plt.subplots(1,2,figsize = (12,5))
plt.suptitle("Target Distribution")sns.histplot(binwidth=0.5,x=TARGET,data=train,hue='Gender',palette="dark",ax=axs[0],discrete=True)
axs[0].tick_params(axis='x', rotation=60)axs[1].pie(train[TARGET].value_counts(),shadow = True,explode=[.1 for i in range(train[TARGET].nunique())],labels = train[TARGET].value_counts().index,autopct='%1.f%%',)plt.tight_layout()
plt.show()

target

单个数值图
fig,axs = plt.subplots(len(raw_num_cols),1,figsize=(12,len(raw_num_cols)*2.5),sharex=False)
for i, col in enumerate(raw_num_cols):sns.violinplot(x=TARGET, y=col,hue="Gender", data=train,ax = axs[i], split=False)if col in full_form.keys():axs[i].set_ylabel(full_form[col])plt.tight_layout()
plt.show()

plt2

个体分类图
_,axs = plt.subplots(int(len(raw_cat_cols)-1),2,figsize=(12,len(raw_cat_cols)*3),width_ratios=[1, 4])
for i,col in enumerate(raw_cat_cols[1:]):sns.countplot(y=col,data=train,palette="bright",ax=axs[i,0])sns.countplot(x=col,data=train,hue=TARGET,palette="bright",ax=axs[i,1])if col in full_form.keys():axs[i,0].set_ylabel(full_form[col])plt.tight_layout()
plt.show()

plt3

数值相关图
tmp = train[raw_num_cols].corr("pearson")
sns.heatmap(tmp,annot=True,cmap ="crest")

heatmap

组合数字图
  • 身高与体重
sns.jointplot(data=train, x="Height", y="Weight", hue=TARGET,height=6)

height & weight

  • 年龄与身高
    age & height
主成分分析(PCA)和KMeans
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans#PCA
pca = PCA(n_components=2)
pca_top_2 = pca.fit_transform(train[raw_num_cols])tmp = pd.DataFrame(data = pca_top_2, columns = ['pca_1','pca_2'])
tmp['TARGET'] = train[TARGET]fig,axs = plt.subplots(2,1,figsize = (12,6))
sns.scatterplot(data=tmp, y="pca_1", x="pca_2", hue='TARGET',ax=axs[0])
axs[0].set_title("Top 2 Principal Components")#KMeans
kmeans = KMeans(7,random_state=RANDOM_SEED)
kmeans.fit(tmp[['pca_1','pca_2']])
sns.scatterplot( y= tmp['pca_1'],x = tmp['pca_2'],c = kmeans.labels_,cmap='viridis', marker='o', edgecolor='k', s=50, alpha=0.8,ax = axs[1])
axs[1].set_title("Kmean Clustring on First 2 Principal Components")
plt.tight_layout()
plt.show()

pca kmeans

特征工程与处理

#在age_rounder、height_rounder函数中,我们将值相乘
#这有时会提高模型的CV分数
#在提取功能中,我们将功能组合以获得新功能def age_rounder(x):x_copy = x.copy()x_copy['Age'] = (x_copy['Age']*100).astype(np.uint16)return x_copydef height_rounder(x):x_copy = x.copy()x_copy['Height'] = (x_copy['Height']*100).astype(np.uint16)return x_copydef extract_features(x):x_copy = x.copy()x_copy['BMI'] = (x_copy['Weight']/x_copy['Height']**2)
#     x_copy['PseudoTarget'] = pd.cut(x_copy['BMI'],bins = [0,18.4,24.9,29,34.9,39.9,100],labels = [0,1,2,3,4,5],)    return x_copydef col_rounder(x):x_copy = x.copy()cols_to_round = ['FCVC',"NCP","CH2O","FAF","TUE"]for col in cols_to_round:x_copy[col] = round(x_copy[col])x_copy[col] = x_copy[col].astype('int')return x_copyAgeRounder = FunctionTransformer(age_rounder)
HeightRounder = FunctionTransformer(height_rounder)
ExtractFeatures = FunctionTransformer(extract_features)
ColumnRounder = FunctionTransformer(col_rounder)
#使用FeatureDropper,我们可以删除列。这是
#如果我们想传递不同的功能集,这一点很重要
#适用于不同模型
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureDropper(BaseEstimator, TransformerMixin):def __init__(self, cols):self.cols = colsdef fit(self,x,y):return selfdef transform(self, x):return x.drop(self.cols, axis = 1)
接下来,我们将定义“cross_val_model”,它将用于训练和验证我们将在本文中使用的所有模型

cross_val_model函数提供了三个内容:val_scoresvalid_prdictionstest_predictions

  • val_scores:这为我们提供了验证数据的准确性分数。
  • valid_products:这是一个数组,用于在验证集上存储模型预测
  • test_predictions:这提供了按我们使用的分割数平均的测试预测
# 使用交叉验证模型
# 结合 分层 K 折.# 对目标分类进行编码
target_mapping = {'Insufficient_Weight':0,'Normal_Weight':1,'Overweight_Level_I':2,'Overweight_Level_II':3, 'Obesity_Type_I':4,'Obesity_Type_II':5 ,'Obesity_Type_III':6}# 定义分层K折交叉验证方法
skf = StratifiedKFold(n_splits=n_splits)def cross_val_model(estimators,cv = skf, verbose = True):'''estimators : pipeline consists preprocessing, encoder & modelcv : Method for cross validation (default: StratifiedKfold)verbose : print train/valid score (yes/no)'''X = train.copy()y = X.pop(TARGET)y = y.map(target_mapping)test_predictions = np.zeros((len(test),7))valid_predictions = np.zeros((len(X),7))val_scores, train_scores = [],[]for fold, (train_ind, valid_ind) in enumerate(skf.split(X,y)):model = clone(estimators)#define train setX_train = X.iloc[train_ind]y_train = y.iloc[train_ind]#define valid setX_valid = X.iloc[valid_ind]y_valid = y.iloc[valid_ind]model.fit(X_train, y_train)if verbose:print("-" * 100)print(f"Fold: {fold}")print(f"Train Accuracy Score:-{accuracy_score(y_true=y_train,y_pred=model.predict(X_train))}")print(f"Valid Accuracy Score:-{accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid))}")print("-" * 100)test_predictions += model.predict_proba(test)/cv.get_n_splits()valid_predictions[valid_ind] = model.predict_proba(X_valid)val_scores.append(accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid)))if verbose: print(f"Average Mean Accuracy Score:- {np.array(val_scores).mean()}")return val_scores, valid_predictions, test_predictions
# 合并原始和生成数据train.drop(['id'],axis = 1, inplace = True)
test_ids = test['id']
test.drop(['id'],axis = 1, inplace=True)train = pd.concat([train,train_org],axis = 0)
train = train.drop_duplicates()
train.reset_index(drop=True, inplace=True)# 产生空的 dataframe 用于存储 得分,训练 ,测试预测 
score_list, oof_list, predict_list = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

建模

在这场比赛中,与其只关注一个模型,不如将许多高性能模型的预测结合起来。在本文中,我们将训练四种不同类型的模型,并将它们的预测结合起来,以获得最终的提交。

  • 随机森林模型
  • LGBM 模型
  • XGB 模型
  • Catboost 模型
随机森林模型
# Define Random Forest Model PipelineRFC = make_pipeline(ExtractFeatures,MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC','SMOKE','SCC','CALC','MTRANS']),RandomForestClassifier(random_state=RANDOM_SEED))
# 执行随机森林模型
val_scores,val_predictions,test_predictions = cross_val_model(RFC)# 保存相应的结果
for k,v in target_mapping.items():oof_list[f"rfc_{k}"] = val_predictions[:,v]for k,v in target_mapping.items():predict_list[f"rfc_{k}"] = test_predictions[:,v]
# 0.8975337326149792
# 0.9049682643904575
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9999027237354086
Valid Accuracy Score:-0.8954048140043763
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9999513618677043
Valid Accuracy Score:-0.9010940919037199
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9999513618677043
Valid Accuracy Score:-0.8940919037199124
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9999027237354086
Valid Accuracy Score:-0.8905908096280087
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9998540856031128
Valid Accuracy Score:-0.9102844638949672
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9999027284665143
Valid Accuracy Score:-0.8975481611208407
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.9998054569330286
Valid Accuracy Score:-0.8966725043782837
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9998540926997714
Valid Accuracy Score:-0.9080560420315237
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9998540926997714
Valid Accuracy Score:-0.9063047285464098
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9999513642332571
Valid Accuracy Score:-0.9610332749562172
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9061080794184259
LGBM 模型

由于LGBM模型的参数非常多,很多均是超参数,本文采用 Optuna 进行微调

# 定义 Optuna 函数 微调模型def lgbm_objective(trial):params = {'learning_rate' : trial.suggest_float('learning_rate', .001, .1, log = True),'max_depth' : trial.suggest_int('max_depth', 2, 20),'subsample' : trial.suggest_float('subsample', .5, 1),'min_child_weight' : trial.suggest_float('min_child_weight', .1, 15, log = True),'reg_lambda' : trial.suggest_float('reg_lambda', .1, 20, log = True),'reg_alpha' : trial.suggest_float('reg_alpha', .1, 10, log = True),'n_estimators' : 1000,'random_state' : RANDOM_SEED,'device_type' : "gpu",'num_leaves': trial.suggest_int('num_leaves', 10, 1000),#'boosting_type' : 'dart',}optuna_model = make_pipeline(ExtractFeatures,MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC','SMOKE','SCC','CALC','MTRANS']),LGBMClassifier(**params,verbose=-1))val_scores, _, _ = cross_val_model(optuna_model,verbose = False)return np.array(val_scores).mean()lgbm_study = optuna.create_study(direction = 'maximize',study_name="LGBM")

如果打开微调开关,将会执行很长时间,请谨慎操作。

# 微调开关 
TUNE = Falsewarnings.filterwarnings("ignore")
if TUNE:lgbm_study.optimize(lgbm_objective, 50)

将原数据分类数值和分类型,方便以下不同操作

numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = train.select_dtypes(include=['object']).columns.tolist()
categorical_columns.remove('NObeyesdad')

以下参数是我微调的结果

best_params = {"objective": "multiclass",          # Objective function for the model"metric": "multi_logloss",          # Evaluation metric"verbosity": -1,                    # Verbosity level (-1 for silent)"boosting_type": "gbdt",            # Gradient boosting type"random_state": 42,       # Random state for reproducibility"num_class": 7,                     # Number of classes in the dataset'learning_rate': 0.030962211546832760,  # Learning rate for gradient boosting'n_estimators': 500,                # Number of boosting iterations'lambda_l1': 0.009667446568254372,  # L1 regularization term'lambda_l2': 0.04018641437301800,   # L2 regularization term'max_depth': 10,                    # Maximum depth of the trees'colsample_bytree': 0.40977129346872643,  # Fraction of features to consider for each tree'subsample': 0.9535797422450176,    # Fraction of samples to consider for each boosting iteration'min_child_samples': 26             # Minimum number of data needed in a leaf
}

类似随机森林的方法进行操作

lgbm = make_pipeline(    ColumnTransformer(transformers=[('num', StandardScaler(), numerical_columns),('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),LGBMClassifier(**best_params,verbose=-1))
# Train LGBM Modelval_scores,val_predictions,test_predictions = cross_val_model(lgbm)for k,v in target_mapping.items():oof_list[f"lgbm_{k}"] = val_predictions[:,v]for k,v in target_mapping.items():predict_list[f"lgbm_{k}"] = test_predictions[:,v]
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9771400778210116
Valid Accuracy Score:-0.9089715536105033
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9767509727626459
Valid Accuracy Score:-0.9076586433260394
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9776264591439688
Valid Accuracy Score:-0.9059080962800875
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9775291828793774
Valid Accuracy Score:-0.9089715536105033
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9770428015564202
Valid Accuracy Score:-0.9164113785557987
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9779679976654831
Valid Accuracy Score:-0.9076182136602452
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.9779193618987403
Valid Accuracy Score:-0.9058669001751314
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9779193618987403
Valid Accuracy Score:-0.9194395796847635
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.977676183065026
Valid Accuracy Score:-0.908493870402802
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9742230436262828
Valid Accuracy Score:-0.9527145359019265
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.91420543252078
XGB 模型

按LGBM方式一样对XGB模型进行操作

# Optuna 处理 xgb
def xgb_objective(trial):params = {'grow_policy': trial.suggest_categorical('grow_policy', ["depthwise", "lossguide"]),'n_estimators': trial.suggest_int('n_estimators', 100, 2000),'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),'gamma' : trial.suggest_float('gamma', 1e-9, 1.0),'subsample': trial.suggest_float('subsample', 0.25, 1.0),'colsample_bytree': trial.suggest_float('colsample_bytree', 0.25, 1.0),'max_depth': trial.suggest_int('max_depth', 0, 24),'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),}params['booster'] = 'gbtree'params['objective'] = 'multi:softmax'params["device"] = "cuda"params["verbosity"] = 0params['tree_method'] = "gpu_hist"optuna_model = make_pipeline(
#                     ExtractFeatures,MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC','SMOKE','SCC','CALC','MTRANS']),XGBClassifier(**params,seed=RANDOM_SEED))val_scores, _, _ = cross_val_model(optuna_model,verbose = False)return np.array(val_scores).mean()xgb_study = optuna.create_study(direction = 'maximize')
# Optuna 微调开关
TUNE = False
if TUNE:xgb_study.optimize(xgb_objective, 50)
# XGB Pipelineparams = {'n_estimators': 1312,'learning_rate': 0.018279520260162645,'gamma': 0.0024196354156454324,'reg_alpha': 0.9025931173755949,'reg_lambda': 0.06835667255875388,'max_depth': 5,'min_child_weight': 5,'subsample': 0.883274050086088,'colsample_bytree': 0.6579828557036317
}
# {'eta': 0.018387615982905264, 'max_depth': 29, 'subsample': 0.8149303101087905, 'colsample_bytree': 0.26750463604831476, 'min_child_weight': 0.5292380065098192, 'reg_lambda': 0.18952063379457604, 'reg_alpha': 0.7201451827004944}params = {'grow_policy': 'depthwise', 'n_estimators': 690, 'learning_rate': 0.31829021594473056, 'gamma': 0.6061120644431842, 'subsample': 0.9032243794829076, 'colsample_bytree': 0.44474031945048287,'max_depth': 10, 'min_child_weight': 22, 'reg_lambda': 4.42638097284094,'reg_alpha': 5.927900973354344e-07,'seed':RANDOM_SEED}best_params = {'grow_policy': 'depthwise', 'n_estimators': 982, 'learning_rate': 0.050053726931263504, 'gamma': 0.5354391952653927, 'subsample': 0.7060590452456204, 'colsample_bytree': 0.37939433412123275, 'max_depth': 23, 'min_child_weight': 21, 'reg_lambda': 9.150224029846654e-08,'reg_alpha': 5.671063656994295e-08}
best_params['booster'] = 'gbtree'
best_params['objective'] = 'multi:softmax'
best_params["device"] = "cuda"
best_params["verbosity"] = 0
best_params['tree_method'] = "gpu_hist"XGB = make_pipeline(
#                     ExtractFeatures,
#                     MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
#                                            'SMOKE','SCC','CALC','MTRANS']),
#                     FeatureDropper(['FAVC','FCVC']),
#                     ColumnRounder,
#                     ColumnTransformer(
#                     transformers=[('num', StandardScaler(), numerical_columns),
#                                   ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC','SMOKE','SCC','CALC','MTRANS']),XGBClassifier(**best_params,seed=RANDOM_SEED))
# 以上不同参数有不同结果
val_scores,val_predictions,test_predictions = cross_val_model(XGB)for k,v in target_mapping .items():oof_list[f"xgb_{k}"] = val_predictions[:,v]for k,v in target_mapping.items():predict_list[f"xgb_{k}"] = test_predictions[:,v]# 0.90634942296329
#0.9117093455898445 with rounder
#0.9163506382522121
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9452821011673151
Valid Accuracy Score:-0.9111597374179431
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.945136186770428
Valid Accuracy Score:-0.9063457330415755
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9449902723735408
Valid Accuracy Score:-0.9080962800875274
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9454280155642023
Valid Accuracy Score:-0.9059080962800875
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9432392996108949
Valid Accuracy Score:-0.9199124726477024
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9460629346821653
Valid Accuracy Score:-0.9128721541155866
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.946160206215651
Valid Accuracy Score:-0.9106830122591943
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9456252127814795
Valid Accuracy Score:-0.9168126094570929
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9446524974466223
Valid Accuracy Score:-0.9106830122591943
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9407130003404504
Valid Accuracy Score:-0.9610332749562172
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9163506382522121
Catboost 模型

用 Optuna 设参

# Optuna Function For Catboost Model
def cat_objective(trial):params = {'iterations': 1000,  # High number of estimators'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),'depth': trial.suggest_int('depth', 3, 10),'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.01, 10.0),'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),'random_seed': RANDOM_SEED,'verbose': False,'task_type':"GPU"}cat_features = ['Gender','family_history_with_overweight','FAVC','FCVC','NCP','CAEC','SMOKE','CH2O','SCC','FAF','TUE','CALC','MTRANS']optuna_model = make_pipeline(ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),CatBoostClassifier(**params,cat_features=cat_features))val_scores,_,_ = cross_val_model(optuna_model,verbose = False)return np.array(val_scores).mean()cat_study = optuna.create_study(direction = 'maximize')

参数结果如下:

params = {'learning_rate': 0.13762007048684638, 'depth': 5, 'l2_leaf_reg': 5.285199432056192, 'bagging_temperature': 0.6029582154263095,'random_seed': RANDOM_SEED,'verbose': False,'task_type':"GPU",'iterations':1000}CB = make_pipeline(
#                         ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),
#                         CatBoostEncoder(cols = cat_features),CatBoostClassifier(**params, cat_features=categorical_columns))

用上述参数训练模型

# Train Catboost Model
val_scores,val_predictions,test_predictions = cross_val_model(CB)
for k,v in target_mapping.items():oof_list[f"cat_{k}"] = val_predictions[:,v]for k,v in target_mapping.items():predict_list[f"cat_{k}"] = test_predictions[:,v]# best 0.91179835368868 with extract features, n_splits = 10
# best 0.9121046227778054 without extract features, n_splits = 10
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9478599221789883
Valid Accuracy Score:-0.9050328227571116
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9498540856031128
Valid Accuracy Score:-0.9054704595185996
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9500972762645914
Valid Accuracy Score:-0.9024070021881838
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.949124513618677
Valid Accuracy Score:-0.9050328227571116
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9482976653696498
Valid Accuracy Score:-0.912472647702407
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9502456106220515
Valid Accuracy Score:-0.9089316987740805
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.950780604056223
Valid Accuracy Score:-0.9045534150612959
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.95073196828948
Valid Accuracy Score:-0.9098073555166375
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9513155974903944
Valid Accuracy Score:-0.9111208406304728
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9446524974466223
Valid Accuracy Score:-0.957968476357268
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9122797541263168

模型融合和评估

由以上四个模型,采用不同的权重,进行 融合

# skf = StratifiedKFold(n_splits=5)
weights = {"rfc_":0,"lgbm_":3,"xgb_":1,"cat_":0}
tmp = oof_list.copy()
for k,v in target_mapping.items():tmp[f"{k}"] = (weights['rfc_']*tmp[f"rfc_{k}"] +weights['lgbm_']*tmp[f"lgbm_{k}"]+weights['xgb_']*tmp[f"xgb_{k}"]+weights['cat_']*tmp[f"cat_{k}"])    
tmp['pred'] = tmp[target_mapping.keys()].idxmax(axis = 1)
tmp['label'] = train[TARGET]
print(f"Ensemble Accuracy Scoe: {accuracy_score(train[TARGET],tmp['pred'])}")cm = confusion_matrix(y_true = tmp['label'].map(target_mapping),y_pred = tmp['pred'].map(target_mapping),normalize='true')cm = cm.round(2)
plt.figure(figsize=(8,8))
disp = ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = target_mapping.keys())
disp.plot(xticks_rotation=50)
plt.tight_layout()
plt.show()"""   BEST     """# Best LB [0,1,0,0]
# Average Train Score:0.9142044335854003
# Average Valid Score:0.91420543252078# Best CV [1,3, 1,1]
# Average Train Score:0.9168308163711971
# Average Valid Score:0.9168308163711971
# adding orignal data improves score

在这里插入图片描述

最终提交

for k,v in target_mapping.items():predict_list[f"{k}"] = (weights['rfc_']*predict_list[f"rfc_{k}"]+weights['lgbm_']*predict_list[f"lgbm_{k}"]+weights['xgb_']*predict_list[f"xgb_{k}"]+weights['cat_']*predict_list[f"cat_{k}"])final_pred = predict_list[target_mapping.keys()].idxmax(axis = 1)sample_sub[TARGET] = final_pred
sample_sub.to_csv("submission.csv",index=False)
sample_sub
idNObeyesdad
020758Obesity_Type_II
120759Overweight_Level_I
220760Obesity_Type_III
320761Obesity_Type_I
420762Obesity_Type_III
1383534593Overweight_Level_II
1383634594Normal_Weight
1383734595Insufficient_Weight
1383834596Normal_Weight
1383934597Obesity_Type_II

13840 rows × 2 columns

结论

  1. 全文,从数据探索(EDA),可视化(VIS),特征工程(FE),交叉验证(CV),建模(MOD),模型评估(EV),到最终的提交(SUB),完整的记录整个过程,给机器学习的初学者提供了一个标准的模板;
  2. 本文的题目是解决多分类问题,评估上只用了一个 混淆矩阵(confusion matrix),在实际应用中还有多个工具可以使用,准确率(Accuracy)、精确度(Precision)、召回率(Recall)AUC得分( AUC_score) F1得分(F1_score);
  3. 文中使用模型融合采用了加权法,除此之外,还有stackingBlendingvoting(分为硬投票和软投票),这些内容,可以在我的早期文章找到相关的内容(原理);
  4. 当时提交的得分为0.92+(Public),最终提交的结果分数如下图,排名为34名 达到1%之前。
    rank
    在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/520194.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

cefsharp(winForm)调用js脚本,js脚本调用c#方法

本博文针对js-csharp交互(相互调用的应用) (一)、js调用c#方法 1.1 类名称:cs_js_obj public class cs_js_obj{//注意,js调用C#,不一定在主线程上调用的,需要用SynchronizationContext来切换到主线程//private System.Threading.SynchronizationContext context;//…

Node 旧淘宝源 HTTPS 过期处理

今天拉取老项目更新依赖,出现 urlshttps%3A%2F%2Fregistry.npm.taobao.org%2Fegg-logger%2Fdownload%2Fegg-logger-2.6.1.tgz: certificate has expired 类似报错。即使删除 node_modules 重新安装,问题依然无法解决。 一、问题演示 二、原因分析 1、淘…

Git误操作补救错失:恢复误删的本地分支、将某个提交从一个分支复制到另一个分支

一、恢复误删的本地分支 作为一枚强迫症,没用的分支总是喜欢及时删删删删掉删掉统统删掉,结果今天发现有些分支还是应该保留。 比如,①前段时间切了个分支用来专门做图表,但因为需求还没有最终确定,已经上线了测试服而…

【黑马程序员】C++项目之机房预约管理系统实战

文章目录 需求系统简介身份介绍机房介绍申请简介系统具体需求 实现菜单与退出功能实现功能测试 创建身份类创建角色基类创建学生类创建教师类创建管理员类 登录模块功能描述登录函数封装各个校色具体登录验证管理员操作界面调用流程 管理员模块构造函数实现管理员子菜单显示添加…

YOLO算法改进Backbone系列之:EdgeViT

摘要:在计算机视觉领域,基于Self-attention的模型(如ViTs)已经成为CNN之外的一种极具竞争力的架构。尽管越来越强的变种具有越来越高的识别精度,但由于Self-attention的二次复杂度,现有的ViT在计算和模型大小方面都有较高的要求。…

人力资源档案和人力资源软件的区别

人力资源档案和人力资源软件是两个不同的概念,可以从以下几个方面进行区分: 1. 定义:人力资源档案是指记录和管理员工个人信息、履历、合同、培训记录等各种人力资源相关文件的集合,主要以纸质或电子形式保存。而人力资源软件是指…

CIA402协议笔记

文章目录 1、对象字典1.1 Mode of Operation( 606 0 h 6060_h 6060h​)1.2 Modes of opration display( 606 1 h ) 6061_h) 6061h​) 2、状态机2.1 控制字(ControlWord、6040h)2.2 状态字(StatusWord、6041h)2.3 shutd…

PyCharm连接远程服务器(保姆级教程)

第一步:配置解释器 File→Settings→Project:xxx→Python Interpreter 增加新的解释器 选择SSH Interpreter 输入远程服务器ip地址、你在服务器上的用户名 输入密码 选择在服务器上配置好的环境 更改映射目录 第二步:部署 Tools→…

网络工程师笔记8

华为VRP系统 设备管理方式 web管理方式 命令行管理方式 修改命令:undo 基础配置命令

进制之间的转换

文章目录 编译过程进制转换1、进制的概念1.1 二进制1.2 八进制1.3 十六进制 进制在程序中的表现方式十进制转二进制将十进制转换成二进制(除2反序取余法)二进制转十进制(权值法) 八进制转十进制将十进制转换成八进制(除8反序取余法…

2024年腾讯云学生服务器活动详细说明、学生机购买流程

2024年腾讯云学生服务器优惠活动「云校园」,学生服务器优惠价格:轻量应用服务器2核2G学生价30元3个月、58元6个月、112元一年,轻量应用服务器4核8G配置191.1元3个月、352.8元6个月、646.8元一年,CVM云服务器2核4G配置842.4元一年&…

大模型相关算法模型学习

对Transformer中Add&Norm层的理解-CSDN博客 GPT模型总结【模型结构及计算过程_详细说明】_gpt结构-CSDN博客 推荐30个以上比较好的中文nlp意图识别模型源码?-CSDN博客 Few-shot learning(少样本学习)入门 - 知乎 (zhihu.com) 零次学习…