第2章监督学习-编程知识

1. 分类与回归
2. 泛化、过拟合与欠拟合
- 模型复杂度与数据集大小的关系
3. 监督学习算法
- 3.1 一些样本数据集
- 3.2 k近邻
- - 3.2.1 k近邻分类（sklearn.neighbors.KNeighborsClassifier）
  - 3.2.2 分析 KNeighborsClassifier
  - 3.3.3 k近邻回归（sklearn.neighbors.KNeighborsRegressor）
  - 3.3.4 分析 KNeighborsRegressor
  - 3.3.5 优点、缺点和参数
- 3.3 线性模型
- - 3.3.1 用于回归的线性模型
  - 3.3.2 线性回归（普通最小二乘法）（sklearn.linear_model.LinearRegression）
  - 3.3.3 岭回归（sklearn.linear_model.Ridge）
  - 3.3.4 lasso（sklearn.linear_model.Lasso）
  - 3.3.5 用于分类的线性模型
  - 3.3.6 用于多分类的线性模型
  - 3.3.7 优点、缺点和参数
- 3.4 朴素贝叶斯分类器
- - 优点、缺点和参数
- 3.5 决策树
- - 3.5.1 构造决策树
  - 3.5.2 控制决策树的复杂度
  - 3.5.3 分析决策树
  - 3.5.4 树的特征重要性
  - 3.5.5 优点、缺点和参数
- 3.6 决策树集成
- - 3.6.1 随机森林
  - - 构造随机森林
    - 分析随机森林
    - 优点、缺点和参数
  - 3.6.2 梯度提升回归树（梯度提升机）
  - - 优点、缺点和参数
- 3.7 核支持向量机（SVM）
- - 3.7.1 线性模型与非线性特征
  - 3.7.2 核技巧
  - 3.7.3 理解SVM
  - 3.7.4 SVM调参
  - 3.7.5 为SVM预处理数据
  - 3.7.6 优点、缺点和参数
- 3.8 神经网络（深度学习，MPL）
- - 3.8.1 神经网络模型
  - 3.8.2 神经网络调参
  - 3.8.3 优点、缺点和参数
  - 估计神经网络的复杂度
4. 分类器的不确定度估计
- 4.1 决策函数（decision_function）
- 4.2 预测概率（predict_proba）
- 4.3 多分类问题的不确定度
- 总结

1. 分类与回归

监督机器学习问题的种类
- 分类
  - 目标：预测类别标签
    - 这些标签来自预定义的可选列表
  - 例
    - 鸢尾花分到三个可能的品种之一
  - 二分类
    - 正类：研究对象
    - 反类
  - 多分类
- 回归
  - 目标：预测一个连续值（浮点数，实数）
区分分类和回归的方法：输出是否具有某种连续性
- 具有连续性：回归问题
- 不具有连续性：分类问题

2. 泛化、过拟合与欠拟合

如果一个模型能够对没见过的数据做出准确的预测，那么它就能从训练集泛化到测试集
简单的模型对新数据的泛化能力更好
如果在拟合模型时过分关注训练集的细节，得到了一个在训练集上表现很好、但不能泛化到新数据上的模型，那么就存在过拟合
如果模型过于简单，在训练集上的表现很差，那么就存在欠拟合
过拟合与欠拟合之间的权衡

模型复杂度与数据集大小的关系

模型复杂度与数据集中输入的变化密切相关：数据集中包含的数据点的变化范围越大，在不发生过拟合的前提下可以使用的模型就越复杂
收集更多的数据点可以有更大的变化范围，所以更大的数据集可以用来构建更复杂的模型

3. 监督学习算法

3.1 一些样本数据集

forge 数据集（低维）

import matplotlib.pyplot as plt
import mglearnX, y = mglearn.datasets.make_forge()
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")
plt.show()

forge数据集的散点图

cancer数据集（高维）

from sklearn.datasets import load_breast_cancercancer = load_breast_cancer()
print(f"cancer.keys():{cancer.keys()}")
# cancer.keys():dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])print(f"shape: {cancer.data.shape}")
# shape: (569, 30)
# 569个数据点
# 30个特征print("counts : {}".format({n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}))
# counts : {'malignant': 212, 'benign': 357}
# 212个恶性
# 357个良性

3.2 k近邻

3.2.1 k近邻分类（sklearn.neighbors.KNeighborsClassifier）

对于每个测试点，查看多少个邻居属于类别0，多少个邻居属于类别1，然后将出现次数更多的类别作为预测结果

3.2.2 分析 KNeighborsClassifier

绘制决策边界

import matplotlib.pyplot as plt
import mglearn
from sklearn.neighbors import KNeighborsClassifierX, y = mglearn.datasets.make_forge()
fig, axes = plt.subplots(1, 3, figsize=(10, 3))for n_neighbors, ax in zip([1, 3, 9], axes):clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=0.4)mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)ax.set_title(f"{n_neighbors} neighbor(s)")ax.set_xlabel("feature 0")ax.set_ylabel("feature 1")axes[0].legend(loc=3)
plt.tight_layout()
plt.show()

不同n_neighbors值得k近邻模型的决策边界

随着邻居个数越来越多，决策边界越来越平滑
更平滑的边界对应更简单的模型
- 使用更少的邻居对应更高的模型复杂度
- 使用更多的邻居对应更低的模型复杂度

模型复杂度和泛化能力之间的关系

import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=66)training_accuracy = []
test_accuracy = []neighbors_settings = range(1, 11)
# [1,10]for n_neighbors in neighbors_settings:clf = KNeighborsClassifier(n_neighbors=n_neighbors)# 构建模型clf.fit(X_train, y_train)training_accuracy.append(clf.score(X_train, y_train))# 记录训练集精度test_accuracy.append(clf.score(X_test, y_test))# 记录测试集精度plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="training accuracy")
plt.xlabel("n_neighbors")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

以n_neighbors为自变量，对比训练集精度和测试集精度

3.3.3 k近邻回归（sklearn.neighbors.KNeighborsRegressor）

单一近邻回归

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_knn_regression(n_neighbors=1)
plt.tight_layout()
plt.show()

单一近邻回归对wave数据集的预测结果

3个近邻回归

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_knn_regression(n_neighbors=3)
plt.tight_layout()
plt.show()

3个近邻回归对wave数据集的预测结果

通过scikit-learn应用回归的k近邻算法

将数据分为训练集和测试集

import mglearn
from sklearn.model_selection import train_test_splitX, y = mglearn.datasets.make_wave(n_samples=40)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

导入类并将其实例化

from sklearn.neighbors import KNeighborsRegressorreg = KNeighborsRegressor(n_neighbors=3)
# n_neighbors: 邻居个数

利用训练集对这个分类器进行拟合
```
reg.fit(X_train, y_train)
```

调用predict方法对测试数据进行预测

print(f"predictions: {reg.predict(X_test)}")

评估模型的泛化能力
- 返回的是 $R^2$ 分数（决定系数）
  - 回归模型预测的优度度量
  - 位于0~1之间
    - 1：完美预测
    - 0：常数模型
```
print(f"set accuracy: {reg.score(X_test, y_test):.2f}")
```

3.3.4 分析 KNeighborsRegressor

import numpy as np
import matplotlib.pyplot as plt
import mglearn
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_splitX, y = mglearn.datasets.make_wave(n_samples=40)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)fig, axes = plt.subplots(1, 3, figsize=(15, 4))line = np.linspace(-3, 3, 1000).reshape(-1, 1)
# 在[-3,3]之间均匀分布的1000个数for n_neighbors, ax in zip([1, 3, 9], axes):reg = KNeighborsRegressor(n_neighbors=n_neighbors)reg.fit(X_train, y_train)ax.plot(line, reg.predict(line))ax.plot(X_train, y_train, "^", c=mglearn.cm2(0), markersize=8)ax.plot(X_test, y_test, "v", c=mglearn.cm2(1), markersize=8)ax.set_title(f"{n_neighbors} neighbor(s) \n train score {reg.score(X_train, y_train):.2f} test score {reg.score(X_test, y_test):.2f}") ax.set_xlabel("feature")ax.set_ylabel("target")axes[0].legend(["predictions", "training", "test"], loc="best")
plt.tight_layout()
plt.show()

不同n_neighbors值的k近邻回归的预测结果对比

仅使用单一邻居
- 训练集中的每个点都对预测结果有显著影响
- 预测结果的图像经过所有数据点
- 导致预测结果非常不稳定
考虑更多的邻居
- 预测结果变得更加平滑
- 对训练数据的拟合也不好

3.3.5 优点、缺点和参数

KNeighbors分类器有2个重要参数
- 邻居个数
  - 使用较小的邻居个数往往可以得到比较好的结果
- 数据点之间距离的度量方法
  - 默认使用欧式距离
优点
- 模型很容易理解，通常不需要过多调节就可以得到不错的性能
缺点
- 如果训练集很大，预测速度可能会比较慢
- 对于有很多特征的数据集效果不好
- 对于大多数特征的大多数取值都为0的数据集（稀疏数据集）来说效果尤其不好

3.3 线性模型

3.3.1 用于回归的线性模型

线性模型预测的一般公式
$\hat{y} = w[0]*x[0] + w[1]*x[1] + \cdots + w[p]*x[p] + b$
- $x [p]$ ：单个数据点的特征
- $w$ ：参数
- $b$ ：参数
- $\hat{y}$ ：预测结果
单一特征的数据集公式
$\hat{y} = w[0]*x[0] + b$

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_linear_regression_wave()
plt.tight_layout()
plt.show()

线性模型对wave数据集的预测结果

用于回归的线性模型的另一种表示形式
- 单一特征的预测结果：一条直线
- 两个特征的预测结果：一个平面
- 更多特征的预测结果：一个超平面
对于有多个特征的数据集而言，线性模型可以非常强大
如果特征数量大于训练数据点的数量，任何目标y都可以（在训练集上）用线性函数完美拟合

3.3.2 线性回归（普通最小二乘法）（sklearn.linear_model.LinearRegression）

寻找参数w和b，使得对训练集的预测值与真实的回归目标值y之间的均方误差最小
- 均方误差：预测值与真实值之差的平方和除以样本数
没有参数
无法控制模型的复杂度

import mglearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splitX, y = mglearn.datasets.make_wave(n_samples=60)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)lr = LinearRegression().fit(X_train, y_train)print(f"{lr.coef_}")
# [0.39390555]
# coef_: 斜率,系数,权重
# Numpy数组print(f"{lr.intercept_}")
# -0.03180434302675973
# intercept_: 截距
# 浮点数print(f"training score: {lr.score(X_train, y_train):.2f}")
# training score: 0.67print(f"test score: {lr.score(X_test, y_test):.2f}")
# test score: 0.66

训练集和测试集上的分数接近，且结果不是很好的情况下，说明存在欠拟合

import mglearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_splitX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lr = LinearRegression().fit(X_train, y_train)print(f"training score: {lr.score(X_train, y_train):.2f}")
# training score: 0.95print(f"test score: {lr.score(X_test, y_test):.2f}")
# test score: 0.61

训练集上的预测非常准确，测试集上的预测预测较差，说明存在过拟合

3.3.3 岭回归（sklearn.linear_model.Ridge）

对w的选择不仅要在训练数据上得到好的结果，而且还要拟合附加约束（正则化）
- w的所有元素值都应接近于0
使用L2正则化

import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)ridge = Ridge().fit(X_train, y_train)print(f"training score: {ridge.score(X_train, y_train):.2f}")
# training score: 0.89print(f"test score: {ridge.score(X_test, y_test):.2f}")
# test score: 0.75

Ridge模型在模型的简单性与训练集性能之间做出权衡

alpha参数：简单性和训练集性能对于模型的重要程度

增大alpha：系数更加趋于0，从而降低训练集性能，提高泛化性能

import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)ridge10 = Ridge(alpha=10).fit(X_train, y_train)print(f"training score: {ridge10.score(X_train, y_train):.2f}")
# training score: 0.79print(f"test score: {ridge10.score(X_test, y_test):.2f}")
# test score: 0.64

减小alpha

import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)print(f"training score: {ridge01.score(X_train, y_train):.2f}")
# training score: 0.93print(f"test score: {ridge01.score(X_test, y_test):.2f}")
# test score: 0.77

import matplotlib.pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lr = LinearRegression().fit(X_train, y_train)
ridge = Ridge().fit(X_train, y_train)
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)plt.plot(ridge.coef_, 's', label="Ridge alpha=1")
plt.plot(ridge10.coef_, '^', label="Ridge alpha=10")
plt.plot(ridge01.coef_, 'v', label="Ridge alpha=0.1")
plt.plot(lr.coef_, 'o', label="LinearRegression")plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)
plt.legend()
plt.show()

不同alpha值的岭回归与线性回归的系数比较

x轴：coef_的元素（系数）
y轴：系数的具体值
alpha的值越小点的范围越大

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_ridge_n_samples()
plt.show()

岭回归和线性回归在波士顿房价数据集上的学习曲线

3.3.4 lasso（sklearn.linear_model.Lasso）

约束w使其接近于0
使用L1正则化
- 某些w刚好为0
自动化的特征选择

import numpy as np
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lasso = Lasso().fit(X_train, y_train)print(f"training score: {lasso.score(X_train, y_train):.2f}")
# training score: 0.29print(f"test score: {lasso.score(X_test, y_test):.2f}")
# test score: 0.21print(f"num of features used: {np.sum(lasso.coef_ != 0)}")
# num of features used: 4

alpha参数

减小alpha

import numpy as np
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lasso001 = Lasso(alpha=0.01, max_iter=10000).fit(X_train, y_train)
# max_iter: 运行迭代的最大次数print(f"training score: {lasso001.score(X_train, y_train):.2f}")
# training score: 0.90print(f"test score: {lasso001.score(X_test, y_test):.2f}")
# test score: 0.77print(f"num of features used: {np.sum(lasso001.coef_ != 0)}")
# num of features used: 33

alpha过小

import numpy as np
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lasso00001 = Lasso(alpha=0.0001, max_iter=10000).fit(X_train, y_train)
# max_iter: 运行迭代的最大次数print(f"training score: {lasso00001.score(X_train, y_train):.2f}")
# training score: 0.95print(f"test score: {lasso00001.score(X_test, y_test):.2f}")
# test score: 0.64print(f"num of features used: {np.sum(lasso00001.coef_ != 0)}")
# num of features used: 94

import matplotlib.pyplot as plt
import numpy as np
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.linear_model import RidgeX, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)lasso = Lasso().fit(X_train, y_train)
lasso001 = Lasso(alpha=0.01, max_iter=10000).fit(X_train, y_train)
lasso00001 = Lasso(alpha=0.0001, max_iter=10000).fit(X_train, y_train)
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)plt.plot(lasso.coef , 's', label="Lasso alpha=1")
plt.plot(lasso001.coef , '^', label="Lasso alpha=0.01")
plt.plot(lasso00001.coef , 'v', label="Lasso alpha=0.0001")
plt.plot(ridge01.coef , 'o', label="Ridge alpha=0.1")plt.legend(ncol=2, loc=(0, 1.05))
plt.ylim(-25, 25)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.show()

不同alpha值得lasso回归与岭回归的系数比较

x轴：coef_的元素（系数）
y轴：系数的具体值
alpha的值越小点的范围越大

实践中，首选岭回归，但如果特征很多，且只有其中几个重要的，选择Lasso
实践中，可以选择scikit-learn中的ElasticNet类
- 结合Lasso和Ridge的惩罚项
- 效果最好
- 需要调节2个参数
  - 一个用于L1正则化
  - 一个用于L2正则化

3.3.5 用于分类的线性模型

二分类的预测公式
$\hat{y} = w[0]*x[0] + w[1]*x[1] + \cdots + w[p]*x[p] + b > 0$
- 如果函数值小于0，则预测类别为-1
- 如果函数值大于0，则预测类别为+1
用于回归的线性模型，输出 $\hat{y}$ 是特征的线性函数，是直线、平面或超平面
用于分类的线性模型，决策边界是输入的线性函数
- 二元线性分类器是利用直线、平面或超平面来分开类别的分类器
学习线性模型算法的区别
- 系数和截距的特定组合对训练数据拟合好坏的度量方法
  - 损失函数
- 是否使用正则化，以及使用哪种正则化方法

常见的线性分类算法

Logistic回归（sklearn.linear_model.LogisticRegression）
线性支持向量机（sklearn.svm.LinearSVC）

import matplotlib.pyplot as plt
import mglearn
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegressionX, y = mglearn.make_forge()fig, axes = plt.subplots(1, 2, figsize=(10, 3))for model, ax in zip([LinearSVC(), LogisticRegression()], axes):clf = model.fit(X, y)mglearn.plots.plot_2d_separator(clf, X, fill=False, eps=0.5, ax=ax, alpha=0.7)mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)ax.set_title(f"{clf.__class__.__name__}")ax.set_xlabel("Feature 0")ax.set_ylabel("Feature 1")axes[0].legend()
plt.show()

线性SVM和Logistic回归在forge数据集上的决策边界（均为默认参数）

x轴：forge数据集的第一个特征
y轴：forge数据集的第二个特征
决策边界为直线
- 直线上方的数据点都会被划为类别1
- 直线下方的数据点都会被划为类别0
两个模型中都有两个点的分类是错误的
- 两个模型都默认使用L2正则化

参数C：决定正则化强度

C值越大，正则化越弱
- C值增大，模型将尽可能将训练集拟合到最好
  - 更强调每个数据点都分类正确的重要性
- C值减小，模型更强调使w接近于0
  - 让算法尽量适应“大多数”数据点

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_linear_svc_regularization()
plt.show()

不同C值的线性SVM在forge数据集上的决策边界

强正则化的模型会选择一条相对水平的线

C=1

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg = LogisticRegression(max_iter=1000).fit(X_train, y_train)
# max_iter: 最大迭代次数print(f"training score: {logreg.score(X_train, y_train):.3f}")
# training score: 0.955print(f"test score: {logreg.score(X_test, y_test):.3f}")
# test score: 0.951

C=100

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg100 = LogisticRegression(C=100, max_iter=1000).fit(X_train, y_train)print(f"training score: {logreg100.score(X_train, y_train):.3f}")
# training score: 0.977print(f"test score: {logreg100.score(X_test, y_test):.3f}")
# test score: 0.965

C=0.01

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg001 = LogisticRegression(C=0.01, max_iter=1000).fit(X_train, y_train)print(f"training score: {logreg001.score(X_train, y_train):.3f}")
# training score: 0.953print(f"test score: {logreg001.score(X_test, y_test):.3f}")
# test score: 0.951

Logistic回归的正则化选择

L2正则化（默认）

系数趋近于0，但永不等于0

import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg = LogisticRegression(max_iter=1000).fit(X_train, y_train)logreg100 = LogisticRegression(C=100, max_iter=1000).fit(X_train, y_train)logreg001 = LogisticRegression(C=0.01, max_iter=1000).fit(X_train, y_train)plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.001")plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0, 0, cancer.data.shape[1])
plt.ylim(-5, 5)plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")plt.legend()
plt.tight_layout()
plt.show()

不同C值的Logistic回归在乳腺癌数据集上学到的系数

L1正则化

C=1

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg = LogisticRegression(max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)
# penalty: 正则化方式
# solver: 解释器，默认为lbfgs(仅支持L2正则化)，而liblinear(支持L1和L2正则化)print(f"training score: {logreg.score(X_train, y_train):.3f}")
# training score: 0.960print(f"test score: {logreg.score(X_test, y_test):.3f}")
# test score: 0.958

C=100

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg100 = LogisticRegression(C=100, max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)print(f"training score: {logreg100.score(X_train, y_train):.3f}")
# training score: 0.986print(f"test score: {logreg100.score(X_test, y_test):.3f}")
# test score: 0.979

C=0.01

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg001 = LogisticRegression(C=0.01, max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)print(f"training score: {logreg001.score(X_train, y_train):.3f}")
# training score: 0.918print(f"test score: {logreg001.score(X_test, y_test):.3f}")
# test score: 0.930

import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)logreg = LogisticRegression(max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)logreg100 = LogisticRegression(C=100, max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)logreg001 = LogisticRegression(C=0.01, max_iter=1000, penalty="l1", solver="liblinear").fit(X_train, y_train)plt.plot(logreg.coef_.T, 'o', label="C=1")
plt.plot(logreg100.coef_.T, '^', label="C=100")
plt.plot(logreg001.coef_.T, 'v', label="C=0.001")plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)
plt.hlines(0, 0, cancer.data.shape[1])
plt.ylim(-5, 5)plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")plt.legend()
plt.tight_layout()
plt.show()

对于不同的C值，L1惩罚的Logistic回归在乳腺癌数据集上学到的系数

3.3.6 用于多分类的线性模型

许多线性分类模型仅适用于二分类问题
- 除Logistic回归
将二分欸算法推广到多分类算法的常用方法：一对其余
- 每个类别都学习一个二分类模型，将这个类别与所有其他类别尽量分来
- 在测试点上运行所有二类分类器来进行预测
- 在对应类别上分数最高的分类器“胜出”，将这个类别标签返回作为预测结果
  - 分类置信方程
    $\cdots + w[p]*x[p] + b$

三分类

import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import make_blobsX, y = make_blobs(random_state=42)mglearn.discrete_scatter(X[:, 0], X[:, 1], y)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0", "Class 1", "Class 2"])
plt.show()

包含3个类别的二维玩具数据集

import numpy as np
import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVCX, y = make_blobs(random_state=42)mglearn.discrete_scatter(X[:, 0], X[:, 1], y)linear_svm = LinearSVC().fit(X, y)line = np.linspace(-15, 15)for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_, ["b", "r", "g"]):plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)plt.xlim(-10, 8)
plt.ylim(-10, 15)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0", "Class 1", "Class 2", "Line class 0", "Line class 1", "Line class 2"], loc=(1.01, 0.3))
plt.show()

3个“一对其余”分类器学到的决策边界

import numpy as np
import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVCX, y = make_blobs(random_state=42)linear_svm = LinearSVC().fit(X, y)mglearn.plot_2d_classification(linear_svm, X, fill=True, alpha=0.7)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)line = np.linspace(-15, 15)for coef, intercept, color in zip(linear_svm.coef_, linear_svm.intercept_, ["b", "r", "g"]):plt.plot(line, -(line * coef[0] + intercept) / coef[1], c=color)plt.xlim(-10, 8)
plt.ylim(-10, 15)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0", "Class 1", "Class 2", "Line class 0", "Line class 1", "Line class 2"], loc=(1.01, 0.3))
plt.tight_layout()
plt.show()

3个“一对其余”分类器得到的多分类决策边界

3.3.7 优点、缺点和参数

主要参数
- 正则化参数
  - 回归模型：alpha
  - 分类模型：C
  - alpha越大或C越小，模型越简单（泛化能力越强）
  - 通常在对数尺度上对alpha和C进行搜索
- 正则化的选择
  - L1正则（lasso）
    - 只希望保留几个重要的特征
    - 希望模型有较高的可解释性
  - L2正则（岭回归，Logistic回归，线性支持向量机）
优点
- 训练速度快
- 预测速度快
- 适用大量数据集以及稀疏数据集
  - 大量数据集：使用岭回归和Logistic回归模型中的solver='sag’选项进行加速
- 容易理解预测的过程
- 适用特征数量大于样本数量的数据集
缺点
- 当数据集中包含高度相关的特征，很难对系数做出解释
- 在更低维的空间中，其他模型具有更好的泛化能力

3.4 朴素贝叶斯分类器

与线性模型非常相似
训练速度较快
- 通过单独查看每个特征来学习参数
- 从每个特征中收集简单的类别统计数据
泛化能力（相比Logistic回归和线性支持向量机）较差
3种朴素贝叶斯分类器
- GaussianNB（sklearn.naive_bayes.GaussianNB）
  - 应用于任意连续数据
  - 保存每个类别中每个特征的平均值和标准差
- BernoulliNB（sklearn.naive_bayes.BernoulliNB）
  - 假定输入数据为二分类数据
  - 主要用于文本数据分类
  - 计算每个类别中每个特征不为0的元素个数
```
import numpy as npX = np.array([[0, 1, 0, 1],[1, 0, 1, 1],[0, 0, 0, 1],[1, 0, 1, 0]])
y = np.array([0, 1, 0, 1])
counts = {}for label in np.unique(y):counts[label] = X[y == label].sum(axis=0)print(f"counts: {counts}")
# counts: {0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])}
```
  - 预测公式与线性模型完全相同
- MultinomialNB（sklearn.naive_bayes.MultinomialNB）
  - 假定输入数据为计数数据
  - 主要用于文本数据分类
  - 计算每个类别中每个特征的平均值
  - 预测公式与线性模型完全相同

优点、缺点和参数

参数（alpha）
- BernoulliNB和MultinomialNB中唯一的参数
- 用于控制模型复杂度
- 工作原理
  - 添加alpha个虚拟数据点，且这些点对所有特征都取正值
- 将统计数据平滑化
- alpha越大，平滑化越强，模型复杂度越低
- 对模型性能影响不大
- 对精度影响较大
应用
- 高维数据
  - GaussianNB
- 稀疏计数数据
  - BernoulliNB
  - MultinomialNB
    - 性能通常优于BernoulliNB
      - 特别在包含很多零特征的数据集（大型文件）上
优点
- 训练速度快
- 预测速度快
- 训练过程容易理解
- 对高维稀疏数据的效果很好
- 对参数的鲁捧性较好

3.5 决策树

用于分类和回归任务的模型
本质：从一层层的if/else问题中进行学习，得出结论

import mglearnmglearn.plots.plot_animal_tree()

区分几种动物的决策树

为了区分4类动物，利用3个特征来构建一个模型
利用监督学习从数据中学习模型，而无需人为构建模型

3.5.1 构造决策树

数据集
- 二维分类数据集two_moons
- 每个类别包含50个数据点
if/else问题称为测试
算法遍历所有可能的测试，找出对目标变量来说信息量最大的那一个
生成决策树的步骤
- 第1个测试
  - 将数据集在x[1]=0.0596处可以最大程度上将类别0中的点和类别1中的点进行区分
  - 根结点表示整个数据集
  - 左结点对应于顶部区域
  - 右结点对应于底部区域
- 第2个测试
  - 基于x[0]做出区分
- 反复进行递归划分
- 第4个测试
  - 如果某个叶结点所包含数据点的目标值都相同，那么这个叶结点就是纯的
预测新数据点的步骤
- 分类
  - 查看这个点位于特征空间划分的哪个区域
  - 将该区域的多数目标值作为预测结果
  - 从根结点开始对树进行遍历就可以找到这一区域
- 回归任务
  - 基于每个结点的测试对树进行遍历
  - 找到新数据带点所属的叶结点
  - 输出：此叶结点中所有训练点的平均目标值

3.5.2 控制决策树的复杂度

构造决策树直到所有叶结点都是纯的叶结点，会导致模型非常复杂，并且对训练数据高度过拟合
防止过拟合的策略
- 预剪枝：及早停止树的生长
  - 限制条件
    - 树的最大深度
    - 叶结点的最大数目
    - 一个结点中数据点的最小数目
- 后剪枝：先构造树，然后删除或折叠信息量很少的结点
2种分类决策树
- DecisionTreeRegressor（sklearn.tree.DecisionTreeRegressor）
- DecisionTreeClassifier（sklearn.tree.DecisionTreeClassifier）
sklearn只实现预剪枝

对树进行完全展开

树不断分支，直到所有叶结点都是纯的

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifiercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)print(f"training score: {tree.score(X_train, y_train):.3f}")
# training score: 1.000print(f"test score: {tree.score(X_test, y_test):.3f}")
# test score: 0.937

对树进行预剪枝

设置max_depth=4

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifiercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)print(f"training score: {tree.score(X_train, y_train):.3f}")
# training score: 0.988print(f"test score: {tree.score(X_test, y_test):.3f}")
# test score: 0.951

3.5.3 分析决策树

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphvizcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)export_graphviz(tree, out_file="tree.dot", class_names=["malignant", "begin"], feature_names=cancer.feature_names, impurity=False, filled=True)

基于乳腺癌数据集构造的决策树的可视化

samples：结点中的样本个数
value：每个类别的样本个数
第一层worst radius <= 16.795分支
- 右侧的子结点中，几乎所有的样本都进入最右侧的叶结点中
- 左侧的子结点中，几乎所有的样本都进入左数第二个叶结点中

3.5.4 树的特征重要性

利用一些有用的属性来总结树的工作原理
- 特征重要性
  - 为每个特征对树的决策的重要性进行排序
  - 对每个特征来说都是介于0和1之间的数字
    - 0：没有用到
    - 1：完美预测
  - 求和始终为1

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifiercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, stratify=cancer.target, random_state=42)tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)n_features = cancer.data.shape[1]
plt.barh(range(n_features), tree.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

在乳腺癌数据集上学到的决策树的特征重要性

如果某个特征的特征重要性很小，可能是因为另一个特征也包含了同样的信息
特征重要性与该特征所对应的类别没有关系

import matplotlib.pyplot as plt
import mglearnmglearn.plots.plot_tree_not_monotone()
plt.show()

一个二维数据集与决策树给出的决策边界

从上图数据中学到的决策树

回归决策树
- DecisionTreeRegressor（sklearn.tree.DecisionTreeRegressor）
- 所有基于树的回归模型不能外推，也不能在训练数据范围之外进行预测

import matplotlib.pyplot as plt
import pandas as pdram_prices = pd.read_csv("mglearn/data/ram_price.csv")plt.semilogy(ram_prices.date, ram_prices.price)
plt.xlabel("Year")
plt.ylabel("Price in $/Mbyte")
plt.tight_layout()
plt.show()

用对数坐标绘制RAM价格的历史发展

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegressionram_prices = pd.read_csv("mglearn/data/ram_price.csv")# 利用2000年前的数据预测2000年后的数据
data_train = ram_prices[ram_prices.date < 2000]
data_test = ram_prices[ram_prices.date >= 2000]# 基于时间预测价格
X_train = data_train.date.to_numpy()[:, np.newaxis]# 对数变换后得到数据与目标之间更简单的关系
y_train = np.log(data_train.price)tree = DecisionTreeRegressor().fit(X_train, y_train)
linear_reg = LinearRegression().fit(X_train, y_train)# 对所有数据进行预测
X_all = ram_prices.date.to_numpy()[:, np.newaxis]pred_tree = tree.predict(X_all)
pred_lr = linear_reg.predict(X_all)price_tree = np.exp(pred_tree)
price_lr = np.exp(pred_lr)plt.semilogy(data_train.date, data_train.price, label="Training data")
plt.semilogy(data_test.date, data_test.price, label="Test data")
plt.semilogy(ram_prices.date, price_tree, label="Tree prediction")
plt.semilogy(ram_prices.date, price_lr, label="Linear prediction")plt.legend()
plt.tight_layout()
plt.show()

线性模型和回归树对RAM价格数据的预测结果对比

对于回归树模型，一旦输入超出了模型训练数据的范围，模型就只能持续预测最后一个已知数据点
树不能在训练数据的范围之外生成“新的”响应

3.5.5 优点、缺点和参数

参数
- 预剪枝参数
  - 在树完全展开之前停止树的构造
  - 防止过拟合
  - max_depth：树的最大深度
  - max_leaf_nodes：叶结点的最大数目
  - min_samples_leaf：一个结点中数据点的最小数目
优点
- 得到的模型容易可视化
- 算法完全不受数据缩放的影响
缺点
- 即使做了预剪枝，也经常会出现过拟合，泛化能力很差

3.6 决策树集成

集成：合并多个机器学习模型来构建更强大模型的方法
有两种集成模型对大量分类和回归的数据集都是有效的
- 随机森林
- 梯度提升决策树

3.6.1 随机森林

本质：许多决策树的集合，其中每棵树都与其它树略有不同
思想：每颗树都有可能过拟合，如果构造很多树，可以对这些树的结果取平均值降低过拟合
将随机性添加到树的构造过程中，以确保每棵树都各不相同
树的随机化方法
- 选择用于构造树的数据点
- 选择每次划分测试的特征
使用分类
- 分类：sklearn.ensemble.RandomForestClassifier
- 回归：sklearn.ensemble.RandomForestRegressor

构造随机森林

步骤
- 确定用于构造的树的个数
  - 参数：n_estimators
- 构造树
  - 对数据进行自助采样，以确保树与树之间是有区别的
    - 从n_samples个数据点中有放回地重复随机抽取一个样本，重复n_samples次
  - 基于这个新创建的数据集来构造决策树
    - 在每个结点处，随机选择特征的一个子集，并对其中一个特征寻找最佳测试
      - 不对每个结点做寻找最佳测试
      - 特征个数：max_features
        如果max_features=n_features，那么每次划分都要考虑数据集的所有特征
        如果max_features=1，那么在划分时无法选择对哪个特征进行测试，只能对随机选择的某个特征搜索不同的阈值
        如果max_features较大，那么随机森林中的树将会十分相似，利用最独特的特征可以轻松拟合数据
        如果max_features较小，那么随机森林中的树差异很大，为了很好地拟合数据，每棵树的深度都要很大
对于回归问题，可以对这些结果取平均值作为最终预测
对于分类问题，使用“软投票”策略

分析随机森林

import matplotlib.pyplot as plt
import mglearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_splitX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
# n_samples: 自助采样的数据点数量X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)forest = RandomForestClassifier(n_estimators=5, random_state=2)
# n_estimators: 用于构造的树的个数forest.fit(X_train, y_train)fig, axes = plt.subplots(2, 3, figsize=(20, 10))
for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):# estimators_: 森林中的树ax.set_title("Tree {}".format(i))mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1], alpha=.4)
axes[-1, -1].set_title("Random Forest")
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.tight_layout()
plt.show()

5棵随机化的决策树找到的决策边界，以及将它们的预测概率取平均后得到的决策边界

随机森林比单独每一棵树的过拟合都要小，给出的决策边界也更符合直觉

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)forest = RandomForestClassifier(n_estimators=100, random_state=0)forest.fit(X_train, y_train)print(f"training score: {forest.score(X_train, y_train):.3f}")
# training score: 1.000print(f"test score: {forest.score(X_test, y_test):.3f}")
# test score: 0.972n_features = cancer.data.shape[1]
plt.barh(range(n_features), forest.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

拟合乳腺癌数据集得到的随机森林的特征重要性

随机森林比单棵树更能从总体把握数据的特征

优点、缺点和参数

参数
- n_jobs：调节使用的内核个数
  - 使用更多的CPU内核，可以让速度线性增加
  - 可以设置n_jobs=-1来使用计算机的所有内核
- random_state：随机状态
  - 不同的随机状态可以彻底改变构建的模型
    - 森林中的树越多，它对随机状态选择的鲁棒性就越好
- n_estimators：用于构造的树的个数
  - 越大越好
  - 对更多的树取平均可以降低过拟合，从而得到鲁棒性更好的集成
  - 不过收益是递减的，而且树越多需要的内存也越多，训练时间也越长
  - 在你的时间/内存允许的情况下尽量多
- max_features：特征个数
  - 取值越小过拟合越低
  - 对于分类，默认值是max_features=sqrt(n_features)
  - 对于回归，默认值是max_features=n_feature
  - 增大max_features或max_leaf_nodes有时也可以提高性能，还可以大大降低用于训练和预测的时间和空间要求
优点
- 目前应用最广泛的机器学习方法之一
- 不需要反复调节参数就可以给出很好的结果
- 不需要对数据进行缩放
- 仍然使用决策树的一个原因是需要决策过程的紧凑表示
  - 随机森林中树的深度往往比决策树还要大（因为用到了特征子集）
- 即使是非常大的数据集，随机森林的表现通常也很好，训练过程很容易并行在功能强大的计算机的多个CPU内核上
缺点
- 对于维度非常高的稀疏数据（文本数据），表现往往不是很好
  - 使用线性模型可能更合适
- 需要很大的内存，训练和预测的速度也比线性模型要慢

3.6.2 梯度提升回归树（梯度提升机）

合并多个决策树来构建一个更为强大的模型
采用连续的方式构造树，每棵树都试图纠正前一棵树的错误
默认情况下，没有随机化，而是强预剪枝
通常使用深度很小（1~5）的树
思想：合并许多简单的模型（深度较小的树）
- 每棵树只能对部分数据做出好的预测
- 添加的树越多，性能就越强
使用分类
- 分类：sklearn.ensemble.GradientBoostingClassifier
- 回归：sklearn.ensemble.GradientBoostingRegressor
对参数设置敏感
- 预剪枝
- 树的数量：n_estimators
- 学习率：learning_rate
  - 控制每棵树纠正前一棵树的错误的强度

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)gbrt.fit(X_train, y_train)print(f"training score: {gbrt.score(X_train, y_train):.3f}")
# training score: 1.000print(f"test score: {gbrt.score(X_test, y_test):.3f}")
# test score: 0.958

可能发生过拟合

添加最大深度

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)gbrt.fit(X_train, y_train)print(f"training score: {gbrt.score(X_train, y_train):.3f}")
# training score: 0.991print(f"test score: {gbrt.score(X_test, y_test):.3f}")
# test score: 0.972n_features = cancer.data.shape[1]
plt.barh(range(n_features), gbrt.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

用于拟合乳腺癌数据集的梯度提升分类器（添加最大深度）给出的特征重要性

添加学习率

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_splitcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01)gbrt.fit(X_train, y_train)print(f"training score: {gbrt.score(X_train, y_train):.3f}")
# training score: 0.988print(f"test score: {gbrt.score(X_test, y_test):.3f}")
# test score: 0.958n_features = cancer.data.shape[1]
plt.barh(range(n_features), gbrt.feature_importances_, align='center')
plt.yticks(np.arange(n_features), cancer.feature_names)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

用于拟合乳腺癌数据集的梯度提升分类器（添加学习率）给出的特征重要性

优点、缺点和参数

参数
- n_estimators：树的数量
  - 增大n_estimators会导致模型更加复杂，进而可能导致过拟合
- learning_rate：学习率
  - 控制每棵树对前一棵树的错误的纠正强度
- 以上两个参数高度相关
  - learning_rate越低，就需要更多的树来构建具有相似复杂度的模型
  - 通常的做法是根据时间和内存的预算选择合适的n_estimators，然后对不同的learning_rate进行遍历
- max_depth（或max_leaf_nodes）
  - 用于降低每棵树的复杂度
  - max_depth通常都设置得很小，一般不超过5
优点
- 监督学习中最强大也最常用的模型之一
- 不需要对数据进行缩放就可以表现得很好
- 适用于二元特征与连续特征同时存在的数据集
缺点
- 需要仔细调参
- 训练时间可能会比较长
- 通常不适用于高维稀疏数据

3.7 核支持向量机（SVM）

可以推广到更复杂模型的扩展，这些模型无法被输入空间的超平面定义
使用分类
- 分类：
  - 在SVC中实现
- 回归：
  - 在SVR中实现

3.7.1 线性模型与非线性特征

添加更多的特征可以让线性模型更加灵活

import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import make_blobsX, y = make_blobs(centers=4, random_state=8)
y %= 2mglearn.discrete_scatter(X[:,0], X[:,1], y)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.tight_layout()
plt.show()

二分类数据集，其类别并不是线性可分的

用线性模型进行分类

import matplotlib.pyplot as plt
import mglearn
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVCX, y = make_blobs(centers=4, random_state=8)
y %= 2linear_svm = LinearSVC().fit(X, y)mglearn.plots.plot_2d_separator(linear_svm, X)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.tight_layout()
plt.show()

线性SVM给出的决策边界

对输入特征进行扩展
- 添加第二个特征的平方，作为一个新特征

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobsX, y = make_blobs(centers=4, random_state=8)
y %= 2
X_new = np.hstack([X, X[:, 1:] ** 2])figure = plt.figure()
ax = figure.add_subplot(projection='3d', elev=-152, azim=-26)# 首先画出所有y==0的点，然后画出所有y==1的点
mask = y == 0
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b', s=60)
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^', s=60)ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature1 ** 2")
plt.tight_layout()
plt.show()

对上图中的数据集进行扩展，新增由feature1导出的的第三个特征

用线性模型将两个类分开

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVCX, y = make_blobs(centers=4, random_state=8)
y %= 2
X_new = np.hstack([X, X[:, 1:] ** 2])linear_svm_3d = LinearSVC().fit(X_new, y)
coef = linear_svm_3d.coef_.ravel()
intercept = linear_svm_3d.intercept_# 显示线性决策边界
figure = plt.figure()
ax = figure.add_subplot(projection='3d', elev=-152, azim=-26)xx = np.linspace(X_new[:, 0].min() - 2, X_new[:, 0].max() + 2, 50)
yy = np.linspace(X_new[:, 1].min() - 2, X_new[:, 1].max() + 2, 50)XX, YY = np.meshgrid(xx, yy)
# 生成网格矩阵ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]
ax.plot_surface(XX, YY, ZZ, rstride=8, cstride=8, alpha=0.3)mask = y == 0
ax.scatter(X_new[mask, 0], X_new[mask, 1], X_new[mask, 2], c='b', s=60)
ax.scatter(X_new[~mask, 0], X_new[~mask, 1], X_new[~mask, 2], c='r', marker='^', s=60)ax.set_xlabel("feature0")
ax.set_ylabel("feature1")
ax.set_zlabel("feature1 ** 2")plt.tight_layout()
plt.show()

线性SVM对扩展后的三维数据集给出的决策边界

将线性SVM模型看作原始特征的函数
- 此时不是线性

import matplotlib.pyplot as plt
import mglearn
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.svm import LinearSVCX, y = make_blobs(centers=4, random_state=8)
y %= 2
X_new = np.hstack([X, X[:, 1:] ** 2])linear_svm_3d = LinearSVC().fit(X_new, y)xx = np.linspace(X_new[:, 0].min() - 2, X_new[:, 0].max() + 2, 50)
yy = np.linspace(X_new[:, 1].min() - 2, X_new[:, 1].max() + 2, 50)XX, YY = np.meshgrid(xx, yy)
ZZ = YY ** 2dec = linear_svm_3d.decision_function(np.c_[XX.ravel(), YY.ravel(), ZZ.ravel()])plt.contourf(XX, YY, dec.reshape(XX.shape), levels=[dec.min(), 0, dec.max()], alpha=0.5)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

将上图给出的决策边界作为两个原始特征的函数

3.7.2 核技巧

原理：直接计算扩展特征表示中数据点之间的距离（内积），而不用实际对扩展进行计算
将数据映射到更高维空间中常用的方法
- 多项式核：在一定阶数内计算原始特征所有可能的多项式
- 径向基函数核（高斯核，RBF核）：考虑所有阶数的所有可能的多项式，但阶数越高，特征的重要性越小

3.7.3 理解SVM

通常只有一部分训练数据点对于定义决策边界来说很重要：位于类别之间边界上的那些点（支持向量）
分类决策：基于新样本点与支持向量之间的距离以及在训练过程中学到的支持向量重要性（dual_coef_）来做出
- 数据点之间的距离（由高斯核给出）
  $k_{rbf} (x_1, x_2) = exp(-\gamma \| x_1 - x_2 \|^2)$
  - $x_1$ 和 $x_2$ ：数据点
  - $x_1 - x_2 \|$ ：欧氏距离
  - $\gamma$ ：控制高斯核宽度的参数

import mglearn
from matplotlib import pyplot as plt
from sklearn.svm import SVCX, y = mglearn.tools.make_handcrafted_dataset()svm = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)mglearn.plots.plot_2d_separator(svm, X, eps=.5)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)# 画出支持向量
sv = svm.support_vectors_# 支持向量的类别标签由dual_coef_的正负号给出
sv_labels = svm.dual_coef_.ravel() > 0mglearn.discrete_scatter(sv[:, 0], sv[:, 1], sv_labels, s=15, markeredgewidth=3)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

RBF核SVM给出的决策边界和支持向量

3.7.4 SVM调参

gamma：控制高斯核宽度
- 决定了点与点之间“靠近”是指多大距离
- 小的gamma
  - 决策边界变化很慢
  - 生成复杂度较低的模型
- 大的gamma
  - 决策边界变化很快
  - 生成复杂度较高的模型
C：正则化参数
- 限制每个点的重要性（dual_coef_）
- 小的C
  - 误分类的点对边界影响很小
- 大的C
  - 误分类的点对边界影响很大

import mglearn
from matplotlib import pyplot as pltfig, axes = plt.subplots(3, 3, figsize=(15, 10))for ax, C in zip(axes, [-1, 0, 3]):for a, gamma in zip(ax, range(-1, 2)):mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)axes[0, 0].legend(["class 0", "class 1", "sv class 0", "sv class 1"], ncol=4, loc=(.9, 1.2))plt.show()

设置不同的C和gamma参数对应的决策边界和支持向量

from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVCcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)svc = SVC()
svc.fit(X_train, y_train)print(f"training score: {svc.score(X_train, y_train):.3f}")
# training score: 0.904print(f"test score: {svc.score(X_test, y_test):.3f}")
# test score: 0.937plt.plot(X_train.min(axis=0), 'o', label="min")
plt.plot(X_train.max(axis=0), '^', label="max")
plt.legend(loc=4)
plt.xlabel("Feature index")
plt.ylabel("Feature magnitude")
plt.yscale("log")plt.tight_layout()
plt.show()

乳腺癌数据集的特征范围

数据集的特征具有完全不同的数量级

3.7.5 为SVM预处理数据

对每个特征进行缩放，使其大致都位于同一范围
将所有特征缩放到0和1之间

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVCcancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)# 计算训练集中每个特征的最小值
min_on_training = X_train.min(axis=0)
# 计算训练集中每个特征的范围（最大值-最小值）
range_on_training = (X_train - min_on_training).max(axis=0)# 减去最小值，然后除以范围
# 这样每个特征都是min=0和max=1
X_train_scaled = (X_train - min_on_training) / range_on_training
print("Minimum for each feature\n{}".format(X_train_scaled.min(axis=0)))
print("Maximum for each feature\n {}".format(X_train_scaled.max(axis=0)))# 利用训练集的最小值和范围对测试集做相同的变换
X_test_scaled = (X_test - min_on_training) / range_on_trainingsvc = SVC()
svc.fit(X_train_scaled, y_train)print(f"training score: {svc.score(X_train_scaled, y_train):.3f}")
# training score: 0.984print(f"test score: {svc.score(X_test_scaled, y_test):.3f}")
# test score: 0.972svc = SVC(C=50)
svc.fit(X_train_scaled, y_train)print(f"training score: {svc.score(X_train_scaled, y_train):.3f}")
# training score: 0.995print(f"test score: {svc.score(X_test_scaled, y_test):.3f}")
# test score: 0.979

3.7.6 优点、缺点和参数

参数
- 正则化参数C
- 核的选择
  - RBF核（默认）
```
kernel='rbf'
```
- 与核相关的参数
  - gamma（RBF核）
    - 高斯核宽度的倒数
优点
- 在各种数据集上的表现都很好
- 允许决策边界很复杂，即使数据只有几个特征
- 在低维数据和高维数据（即很少特征和很多特征）上的表现都很好
- 在有多达10000个样本的数据上运行SVM可能表现良好
缺点
- 对样本个数的缩放表现不好
- 如果数据量达到100000甚至更大，在运行时间和内存使用方面可能会面临挑战
- 预处理数据和调参都需要非常小心
- 很难检查
  - 很难理解为什么会这么预测
  - 难以将模型向非专家进行解释

3.8 神经网络（深度学习，MPL）

多层感知机（MLP）
- （普通）前馈神经网络
- 简称：神经网络

3.8.1 神经网络模型

MLP可以被视为广义的线性模型
线性模型的预测公式
$\hat{y} = w[0]*x[0] + w[1]*x[1] + \cdots + w[p]*x[p] + b$
- $\hat{y}$ ：输入特征 $x [0]$ 到 $x [p]$ 的加权求和
权重：学到的系数 $w [0]$ 到 $w [p]$
公式可视化
```
import mglearnmglearn.plots.plot_logistic_regression_graph()
```
- 左边的每个结点：一个输入特征
- 连线：学到的系数
- 右边的结点：输出（输入的加权求和）
MLP多次重复计算加权求和的过程
- 步骤
  - 计算代表中间过程的隐单元
  - 计算这些隐单元的加权求和
  - 得到最终结果
```
import mglearnmglearn.plots.plot_single_hidden_layer_graph()
```
- 需要学习更多的系数（权重）
  - 在每个输入与每个隐单元之间有一个系数
  - 在每个隐单元与输出之间有一个系数
为了让MPL比线性模型强大，需要一个技巧
- 计算完每个隐单元的加权求和之后，对结果再应用一个非线性函数
  - 校正非线性（relu）
  - 正切双曲线（tanh）
- 将这个函数的结果用于加权求和，计算得到输出 $\hat{y}$
```
import numpy as np
from matplotlib import pyplot as pltline = np.linspace(-3, 3, 100)plt.plot(line, np.tanh(line), label="tanh")
plt.plot(line, np.maximum(line, 0), label="relu")plt.legend(loc="best")
plt.xlabel("x")
plt.ylabel("relu(x), tanh(x)")plt.tight_layout()
plt.show()
```
- relu
  - 截断小于0的值
- tanh
  - 输入较小时：接近-1
  - 输入较大时：接近+1
对于小型的神经网络（单隐层的多层感知机），计算回归问题的 $\hat{y}$ 的完整公式（使用tanh非线性）
$\\ h[1] = tanh(w[0,0]*x[0] + w[1,0]*x[1] + w[2,0]*x[2] + w[3,0]*x[3] + b[1]) \\ h[2] = tanh(w[0,0]*x[0] + w[1,0]*x[1] + w[2,0]*x[2] + w[3,0]*x[3] + b[2]) \\ \hat{y} = v[0]*h[0] + v[1]*h[1] + v[2]*h[2] + b$
- $w$ ：输入 $x$ 与隐层 $h$ 之间的权重
- $v$ ：隐层 $h$ 与输出 $\hat{y}$ 之间的权重
- 权重 $w$ 和 $v$ 要从数据中学习得到
- $x$ ：输入特征
- $\hat{y}$ ：计算得到的输出
- $h$ ：计算的中间结果
隐层中的结点个数：需要用户设置的一个重要参数
- 对于非常小或非常简单的数据集：结点数可以很小
- 对于非常复杂的数据集：结点数可以很大

添加多个隐层

import mglearnmglearn.plots.plot_two_hidden_layer_graph()

有两个隐层的多层感知机

3.8.2 神经网络调参

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)mlp = MLPClassifier(solver='lbfgs', random_state=0).fit(X_train, y_train)mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

在这里插入图片描述

神经网络学到的决策边界完全是非线性的，但相对平滑
默认情况下使用100个隐结点

减少隐结点的数量

降低模型复杂度

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[10]).fit(X_train, y_train)
# hidden_layer_sizes: 隐结点数量mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

包含10个隐单元的神经网络在two_moons数据集上学到的决策边界

决策边界更加参差不齐

默认的非线性是relu
如果使用单隐层，那么决策函数将由直线段组成

得到更加平滑的决策的3种方法

添加更多隐结点

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[100]).fit(X_train, y_train)
# 隐结点数上升为100mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

在这里插入图片描述

添加第二个隐层

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)
# 使用2个隐层，每个包含10个单元mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

包含2个隐层、每个隐层包含10个隐单元的神经网络学到的决策边界（激活函数为relu）

使用tanh非线性

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)mlp = MLPClassifier(solver='lbfgs', activation='tanh', random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)
# 使用2个隐层，每个包含10个单元，使用tanh非线性mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train)plt.xlabel("Feature 0")
plt.ylabel("Feature 1")plt.tight_layout()
plt.show()

包含2个隐层、每个隐层包含10个隐单元的神经网络学到的决策边界（激活函数为tanh）

利用L2惩罚使权重趋向于0

alpha：alpha越大，模型越简单（泛化能力越强）

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)fig, axes = plt.subplots(2, 4, figsize=(20, 8))
for axx, n_hidden_nodes in zip(axes, [10, 100]):for ax, alpha in zip(axx, [0.0001, 0.01, 0.1, 1]):mlp = MLPClassifier(solver='lbfgs', random_state=0,hidden_layer_sizes=[n_hidden_nodes, n_hidden_nodes],alpha=alpha)mlp.fit(X_train, y_train)mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3, ax=ax)mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=ax)ax.set_title("n_hidden=[{}, {}]\nalpha={:.4f}".format(n_hidden_nodes, n_hidden_nodes, alpha))plt.tight_layout()
plt.show()

不同隐单元个数与alpha参数的不同设定下的决策函数

重要性质

在开始学习之前其权重是随机的，这种随机初始化会影响学到的模型
即使使用完全相同的参数，如果随机种子不同的话，可能会得到非常不一样的模型

from matplotlib import pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moonsX, y = make_moons(n_samples=100, noise=0.25, random_state=3)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)fig, axes = plt.subplots(2, 4, figsize=(20, 8))
for i, ax in enumerate(axes.ravel()):mlp = MLPClassifier(solver='lbfgs', random_state=i, hidden_layer_sizes=[100, 100])mlp.fit(X_train, y_train)mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3, ax=ax)mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, ax=ax)ax.set_title("random_state={}".format(i))plt.tight_layout()
plt.show()

相同参数但不同随机初始化的情况下学到的决策函数

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancercancer = load_breast_cancer()
print("Cancer data per-feature maxima:\n{}".format(cancer.data.max(axis=0)))
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)mlp = MLPClassifier(random_state=42).fit(X_train, y_train)print(f"training score: {(mlp.score(X_train, y_train)):.3f}")
# training score: 0.939print(f"test score: {mlp.score(X_test, y_test):.3f}")
# test score: 0.916

MLP要求所有输入特征的变化范围相似
- 最理想情况
  - 均值：0
  - 方差：1

from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)# 计算训练集中每个特征的平均值
mean_on_train = X_train.mean(axis=0)
# 计算训练集中每个特征的标准差
std_on_train = X_train.std(axis=0)# 减去平均值，然后乘以标准差的倒数
# 如此运算之后，mean=0，std=1
X_train_scaled = (X_train - mean_on_train) / std_on_train
# 对测试集做相同的变换（使用训练集的平均值和标准差）
X_test_scaled = (X_test - mean_on_train) / std_on_trainmlp = MLPClassifier(random_state=0).fit(X_train_scaled, y_train)print(f"training score: {mlp.score(X_train_scaled, y_train):.3f}")
# training score: 0.991print(f"test score: {mlp.score(X_test_scaled, y_test):.3f}")
# test score: 0.965# 增加迭代次数
mlp = MLPClassifier(max_iter=1000, random_state=0).fit(X_train_scaled, y_train)print(f"training score: {mlp.score(X_train_scaled, y_train):.3f}")
# training score: 1.000print(f"test score: {mlp.score(X_test_scaled, y_test):.3f}")
# test score: 0.972# 增大alpha参数，增强泛化性能
mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=0).fit(X_train_scaled, y_train)print(f"training score: {mlp.score(X_train_scaled, y_train):.3f}")
# training score: 0.988print(f"test score: {mlp.score(X_test_scaled, y_test):.3f}")
# test score: 0.972

查看模型的权重

from matplotlib import pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifiercancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)mlp = MLPClassifier(solver='lbfgs', random_state=0).fit(X_train, y_train)plt.figure(figsize=(20, 5))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')
plt.yticks(range(30), cancer.feature_names)plt.xlabel("Columns in weight matrix")
plt.ylabel("Input feature")
plt.colorbar()plt.tight_layout()
plt.show()

神经网络在乳腺癌数据集上学到的第一个隐层权重的热图

其它的深度学习库
- keras
  - 可以使用tensor-flow和theano
- lasagna
  - 基于theano
- tensor-flow

3.8.3 优点、缺点和参数

优点
- 能够获取大量数据中包含的信息，并构建无比复杂的模型
- 给定足够的计算时间和数据，并且仔细调节参数，神经网络通常可以打败其他机器学习算法（无论是分类任务还是回归任务）
缺点
- 通常需要很长的训练时间
- 需要仔细地预处理数据
参数
- 层数
- 每层的隐单元个数
在“均匀”数据上的性能最好
- “均匀”：所有特征都具有相似的含义
如果数据包含不同种类的特征，那么基于树的模型可能表现得更好

估计神经网络的复杂度

首先设置1个或2个隐层，然后可以逐步增加
每个隐层的结点个数通常与输入特征个数接近，但在几千个结点时很少会多于特征个数
在考虑神经网络的模型复杂度时，一个有用的度量是学到的权重（或系数）的个数
如果有一个包含100个特征的二分类数据集，模型有100个隐单元
- 输入层和第一个隐层之间就有100*100=10000个权重
- 隐层和输出层之间还有100*1=100个权重
- 总共约10100个权重
如果添加含有100个隐单元的第二个隐层
- 第一个隐层和第二个隐层之间又有100*100=10000个权重
- 总数变为约20100个权重
如果使用包含1000个隐单元的单隐层
- 输入层和隐层之间需要学习100*1000=100000个权重
- 隐层到输出层之间需要学习1000*1=1000个权重
- 总共101000个权重
如果再添加第二个隐层
- 会增加1000*1000=1000000个权重
- 总数变为巨大的1101000个权重
- 比含有2个隐层、每层100个单元的模型要大50倍
调参的常用方法
- 创建一个大到足以过拟合的网络，确保这个网络可以对任务进行学习
- 缩小网络或增大alpha来增强正则化（提高泛化性能）
如何学习模型（用来学习参数的算法）
- 由solver参数设定
  - solver有两个好用的选项
  - 默认选项是’adam’
    - 在大多数情况下效果都很好
    - 对数据的缩放相当敏感（因此，始终将数据缩放为均值为0、方差为1是很重要的）
  - 另一个选项是’lbfgs’
    - 鲁棒性相当好
    - 在大型模型或大型数据集上的时间会比较长

4. 分类器的不确定度估计

两个获取分类器的不确定度估计的函数
- decision_function
- predict_proba

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)# 为了便于说明，我们将两个类别重命名为"blue"和"red"
y_named = np.array(["blue", "red"])[y]# 我们可以对任意个数组调用train_test_split
# 所有数组的划分方式都是一致的
X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)# 构建梯度提升模型
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)

4.1 决策函数（decision_function）

二分类问题，decision_function返回值的形状为(n_samples,)
- 每个样本都返回一个浮点数

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)print("X_test.shape: {}".format(X_test.shape))
# X_test.shape: (25, 2)print("Decision function shape: {}".format(gbrt.decision_function(X_test).shape))
# Decision function shape: (25,)

正值：对正类（类别1）的偏好
负值：对反类（类别0）的偏好

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)print("Decision function: {}".format(gbrt.decision_function(X_test)[:6]))
# Decision function: [ 4.13592603 -1.67785652 -3.95106099 -3.62604651  4.28986642  3.66166081]

通过查看决策函数的正负号来再现预测值

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)print("decision function:\n{}".format(gbrt.decision_function(X_test) > 0))
# decision function: 
# [ True False False False  True  True False  True  True  True False  True
#   True False  True False False False  True  True  True  True  True False
#   False]print("predictions:\n{}".format(gbrt.predict(X_test)))
# predictions:
# ['red' 'blue' 'blue' 'blue' 'red' 'red' 'blue' 'red' 'red' 'red' 'blue'
#  'red' 'red' 'blue' 'red' 'blue' 'blue' 'blue' 'red' 'red' 'red' 'red'
#  'red' 'blue' 'blue']

再现predict的输出

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)# 将布尔值True/False转换成0和1
greater_zero = (gbrt.decision_function(X_test) > 0).astype(int)
# 利用0和1作为classes的索引
pred = gbrt.classes_[greater_zero]print("predictions:\n{}".format(pred))
# predictions:
# ['red' 'blue' 'blue' 'blue' 'red' 'red' 'blue' 'red' 'red' 'red' 'blue'
#  'red' 'red' 'blue' 'red' 'blue' 'blue' 'blue' 'red' 'red' 'red' 'red'
#  'red' 'blue' 'blue']

decision_function可以在任意范围取值
- 取决于数据与模型参数

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)decision_function = gbrt.decision_function(X_test)print("decision function minimum: {:.2f} maximum: {:.2f}".format(np.min(decision_function), np.max(decision_function)))
# decision function minimum: -7.69 maximum: 4.29

在二维平面种画出所有点的descision_funciton与决策边界

from matplotlib import pyplot as plt
import mglearn
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)fig, axes = plt.subplots(1, 2, figsize=(13, 5))
mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)
scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm=mglearn.ReBl)for ax in axes:# 画出训练点和测试点mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test, markers='^', ax=ax)mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, markers='o', ax=ax)ax.set_xlabel("Feature 0")ax.set_ylabel("Feature 1")cbar = plt.colorbar(scores_image, ax=axes.tolist())
axes[1].legend(["Test class 0", "Test class 1", "Train class 0", "Train class 1"], ncol=4, loc=(-.9, 1.1))plt.show()

梯度提升模型在一个二维玩具数据集上的决策边界（左）和决策函数（右）

4.2 预测概率（predict_proba）

二分类问题，predict_proba返回值的形状为(n_samples, 2)

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)print("Shape of probabilities: {}".format(gbrt.predict_proba(X_test).shape))
# Shape of probabilities: (25, 2)

第一个元素：第一个类别的估计概率
第二个元素：第二个类别的估计概率

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)# 显示predict_proba的前几个元素
print("Predicted probabilities:\n{}".format(gbrt.predict_proba(X_test[:6])))
# Predicted probabilities:
# [[0.01573626 0.98426374]
#  [0.84262049 0.15737951]
#  [0.98112869 0.01887131]
#  [0.97406909 0.02593091]
#  [0.01352142 0.98647858]
#  [0.02504637 0.97495363]]

不确定度的大小实际上反映了数据依赖于模型和参数的不确定度
- 过拟合更强的模型：置信度可能更高
- 复杂度越低的模型：通常对预测的不确定性越大
校正模型：模型给出的不确定度符合实际情况
在二维平面种画出所有点的predict_proba与决策边界

from matplotlib import pyplot as plt
import mglearn
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_splitX, y = make_circles(noise=0.25, factor=0.5, random_state=1)y_named = np.array(["blue", "red"])[y]X_train, X_test, y_train_named, y_test_named, y_train, y_test = train_test_split(X, y_named, y, random_state=0)gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train_named)fig, axes = plt.subplots(1, 2, figsize=(13, 5))
mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)
scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm=mglearn.ReBl, function='predict_proba')for ax in axes:# 画出训练点和测试点mglearn.discrete_scatter(X_test[:, 0], X_test[:, 1], y_test, markers='^', ax=ax)mglearn.discrete_scatter(X_train[:, 0], X_train[:, 1], y_train, markers='o', ax=ax)ax.set_xlabel("Feature 0")ax.set_ylabel("Feature 1")cbar = plt.colorbar(scores_image, ax=axes.tolist())
axes[1].legend(["Test class 0", "Test class 1", "Train class 0", "Train class 1"], ncol=4, loc=(-.9, 1.1))plt.show()

梯度提升模型在一个二维玩具数据集上的决策边界（左）和预测概率（右）

4.3 多分类问题的不确定度

三分类问题

from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_splitiris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbrt.fit(X_train, y_train)print("Decision function shape: {}".format(gbrt.decision_function(X_test).shape))
# Decision function shape: (38, 3)# 显示决策函数的前几个元素
print("Decision function:\n{}".format(gbrt.decision_function(X_test)[:6, :]))
# Decision function:
# [[-1.995715    0.04758267 -1.92720695]
#  [ 0.06146394 -1.90755736 -1.92793758]
#  [-1.99058203 -1.87637861  0.09686725]
#  [-1.995715    0.04758267 -1.92720695]
#  [-1.99730159 -0.13469108 -1.20341483]
#  [ 0.06146394 -1.90755736 -1.92793758]]

多分类问题，decision_function返回值的形状为(n_samples, n_classes)
- 每一列对应每个类别的“确定度分数”
  - 分数较高的类别可能性更大
  - 分数较小的类别可能性更小
再现预测结果

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_splitiris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbrt.fit(X_train, y_train)print("Argmax of decision function: {}".format(np.argmax(gbrt.decision_function(X_test), axis=1)))
# Argmax of decision function: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0]print("Predictions: {}".format(gbrt.predict(X_test)))
# Predictions: [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1 0]

多分类问题，predict_proba返回值的形状为(n_samples, n_classes)

from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_splitiris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
gbrt.fit(X_train, y_train)# 显示predict_proba的前几个元素
print("Predicted probabilities:\n{}".format(gbrt.predict_proba(X_test)[:6]))
# Predicted probabilities:
# [[0.10217718 0.78840034 0.10942248]
#  [0.78347147 0.10936745 0.10716108]
#  [0.09818072 0.11005864 0.79176065]
#  [0.10217718 0.78840034 0.10942248]
#  [0.10360005 0.66723901 0.22916094]
#  [0.78347147 0.10936745 0.10716108]]# 显示每行的和都是1
print("Sums: {}".format(gbrt.predict_proba(X_test)[:6].sum(axis=1)))
# Sums: [1. 1. 1. 1. 1. 1.]

总结

decision_function与predict_proba的形状始终相同，均为(n_samples, n_classes)
- 除二分类问题，decision_function为(n_samples, 1)
如果有n_classes列，可以通过计算每一列的argmax来实现预测结果

对比predict的结果与decision_function或predict_proba的结果，需要使用分类器的classes_属性来获取真实的属性名称

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_splitiris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)logreg = LogisticRegression()
named_target = iris.target_names[y_train]
logreg.fit(X_train, named_target)print("unique classes in training data: {}".format(logreg.classes_))
# unique classes in training data: ['setosa' 'versicolor' 'virginica']print("predictions: {}".format(logreg.predict(X_test)[:10]))
# predictions: ['versicolor' 'setosa' 'virginica' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor' 'versicolor']argmax_dec_func = np.argmax(logreg.decision_function(X_test), axis=1)print("argmax of decision function: {}".format(argmax_dec_func[:10]))
# argmax of decision function: [1 0 2 1 1 0 1 2 1 1]print("argmax combined with classes: {}".format(logreg.classes_[argmax_dec_func][:10]))
# argmax combined with classes: ['versicolor' 'setosa' 'virginica' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'virginica' 'versicolor' 'versicolor']