AHU 人工智能实验-CCA-编程知识

神经网络覆盖算法——CCA（基于Ling Zhang 和Bo Zhang论文)

Abstract

在这篇文章中我将介绍基于张铃和张钹学者提出的CCA算法，并实现代码复现，给出使用的数据集，以及实验结果对比。

1. Introduction

1.1 Background

我们知道自从1943年 McCulloch和Pitts第一次提出神经元的数学模型以来，神经网络开始逐步得到长足的发展。一个神经元的数学模型如下：

在这里插入图片描述

1.2 Motivation

但是长久以来人工智能领域的学者都在寻找合理的几何意义，去解释上述神经元。基于此种困境，我国学者张铃和张钹提出覆盖算法（下文都用CCA代替）。

2. Geometrical Representation of Neural Model

2.1 The M-P Neuron

我们先从神经元的功能开始聊起，对于一个神经元数学模型 $\text{sgn}(wx - \theta)$ ，其中 $w$ 和 $x$ 都是 $n$ 维向量， $n$ 取决于输入样本的特征数，而 $\theta$ 是阈值，也可以被称为偏置， $s g n$ 是一个激活函数，它负责将最后运算的结果映射输出，传递到下一个神经元。

2.2 Sphere Neighborhoods

我们先从原来的分类算法的几何意义开始聊起。

假如我们现在有一个数据集，数据集里的数据来自两类，一个是小狗，一个是小猫，小狗的数据样本为 $X = (x 1, x 2, x 3, . . ., x n, L a b e l = 0)$ ，小猫的数据样本为 $X = (x 1, x 2, x 3, . . ., x n, L a b e l = 1)$ ，我们这里便于举例子，将小猫和小狗的特征维度假定为2，那我们在二维的向量空间中就可以标记小猫小狗的样本分布，当然为了让分布在空间中分布的更加均匀，我们首先对所有的数据进行归一化。

归一化公式如下：
$\text{Normalized Value} = \frac{x - \text{Min}(x)}{\text{Max}(x) - \text{Min}(x)}$

我们现在绘出样本在二维空间中的分布：

在这里插入图片描述

由于如果使用超平面直接划分，则至少需要三个超平面，而且几何意义很难解释,同时在输入向量的时候维度可能不同，因此我们选择将数据升维，即将原来的样本空间投射到n+1维的一个向量空间中，这样可以保证样本长度相同。如图所示:

在这里插入图片描述

3. Constructive Algorithm for Neural Networks

下面我们重点介绍本篇文章的重点部分，即CCA的实现步骤。

3.1 归一化输入样本

公式如下
$\text{Normalized Value} = \frac{x - \text{Min}(x)}{\text{Max}(x) - \text{Min}(x)}$
相应代码：

# 对读入的数据进行归一化处理
def normalize_data(data):# 数据集的最后一列是标签，将其排除在归一化之外features = data.iloc[:, :-1]labels = data.iloc[:, -1]scaler = MinMaxScaler()# 将特征数据进行归一化normalized_features = scaler.fit_transform(features)# 构建归一化后的数据集，包含原标签列normalized_data_with_labels = pd.DataFrame(normalized_features, columns=features.columns)normalized_data_with_labels['Label'] = labelsreturn normalized_data_with_labels

3.2 升维，将样本投影至 $n + 1$ 维样本空间

公式如下：

$\rightarrow S^{n+1}, \quad T(x) = \left(x, \sqrt{R^2 - \lVert x \rVert^2}\right)$ ,其中 $R$ 是所有输入样本中模值最大的值。

补充说明， $\lVert x \rVert^2 = (x1^2+x2^2+...+xn^2)$

举例说明

在这里插入图片描述

其中 $R = 8$ , 即是 $x 4$ 的模值，因为它最大，由上述公式我们可得

在这里插入图片描述

升维完毕。

相应代码：

# 升维函数
def uplift_dimension(normalized_data):# 计算模最大值max_norm = torch.norm(torch.tensor(normalized_data.iloc[:, :-1].values, dtype=torch.float32), dim=1).max().item()# 升维uplifted_data = []for i, sample in enumerate(normalized_data.values):original_norm = torch.norm(torch.tensor(sample[:-1].astype(float)))uplifted_value = torch.sqrt(max_norm**2 - original_norm**2).item()uplifted_sample = list(sample[:-1]) + [uplifted_value] + [int(sample[-1])]uplifted_data.append(uplifted_sample)# 创建新的 DataFrameuplifted_columns = list(normalized_data.columns)[:-1] + ['Uplifted'] + [normalized_data.columns[-1]]uplifted_df = pd.DataFrame(uplifted_data, columns=uplifted_columns)return uplifted_df

3.3 构造隐层神经元

我们这里采用的是折中半径算法，这里给出我的一些证明。

我们先随机的选中样本空间中的一个样本记为 $Sample_A$ ，然后找到最小欧几里得距离的异类样本记为 $Sample_B$ ，距离记为 $d_2$ ，如图所示：

在这里插入图片描述

然后我们在此前提下，找出不大于 $d_2$ 的欧几里得距离最大的同类样本 $Sample_C$ ，距离记为 $d_1$ ,然后取折中半径 $d=(d_1+d_2)/2$ , 如图所示：

在这里插入图片描述

但是我们发现在此种情况下，我们可以使用内积来代替欧几里得距离，证明如下：

欧几里得距离：在这里插入图片描述

这里由于数据已经归一化过了，所以所有的样本模长都等于 $R^2$ ,所以上述距离公式可以化简为 $dist(x_k,x_i)= \sqrt{2R^2-2<x_k,x_i>}$ ,其中 $x_k,x_i>$ 代表两者的内积，因为 $R^2$ 是一个固定值，所以求最小距离等价于求最大的内积，同理求最大距离等价于求最小的内积。

所以上述问题转化为求最大内积的异类样本点，求最小内积的同类样本点。

对于覆盖模型的数据结构，我们采用以下结构体：

# 定义覆盖结构体
class CoverStruct:def __init__(self, center, radius, label, index):self.center = centerself.radius = radiusself.label = labelself.index = index

我们每训练得一个覆盖，就记录他的中心和半径，以及该覆盖的标签，和序号。

3.4 测试过程

我们采用十折交叉验证，这里解释一下十折交叉验证（图片来自周志华老师的《机器学习》）：

在这里插入图片描述

验证的过程是，每个测试集测试的时候，我们逐个验证覆盖列表中的每一个覆盖，顺序遍历每一个覆盖，直到找到一个覆盖能包含测试样本为止，然后预测标签就等于覆盖标签，如果遍历所有覆盖我们都没有找能够包含测试样本的覆盖，那么我们选取距离覆盖中心内积值最大的（即距离中心欧几里得距离最小的）覆盖的覆盖标签作为预测标签。

4. Results and Comparison

Database	CCA(accuracy)	My_CCA(accuracy)	Accuracy Difference
Haberman+s+survival	65.30%	70.24%	4.94%
Fertility	80.48%	81.20%	0.72%
Ionosphere	90.77%	91.43%	0.66%

5. Reflection

这是本人自己第一次实现一个简单的神经网络，经历了日日夜夜的写代码，改代码的痛苦过程，当最后的预测结果和论文差不多时，心情是非常激动的，在此之前，我对神经网络的认识仅仅停留在输入层，输出层，Hidden Layer，还有激活函数，优化等名词，我的总结是人工智能这门课，你不能当作一门理论课去学习，而是应当当作一门实验课去学习，甚至是当作一门语言去学习，我指的是你需要写大量的代码来提高自己的能力。做永远比想更为重要！

6.Resource

数据集来源：

https://archive.ics.uci.edu/datasets?skip=10&take=10&sort=desc&orderBy=NumHits&search=Glass

开源代码：

Pytorch的安装方法可以参考CSDN上其他热门答主的回答。

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import KFold
import random
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer# 定义覆盖结构体
class CoverStruct:def __init__(self, center, radius, label, index):self.center = centerself.radius = radiusself.label = labelself.index = index# 定义输出结构体
class CoveringResult:def __init__(self, label, result):self.label = labelself.result = result
def get_covered_samples(data, cover):# 找到在给定覆盖中的样本集合covered_samples = set()for i, sample in data.iterrows():sample_values = torch.tensor(sample[:-1].values, dtype=torch.float32)# 计算样本到覆盖中心的内积inner_product = torch.dot(sample_values, torch.tensor(cover.center))# 如果距离大于等于覆盖半径，将该样本加入集合if inner_product >= cover.radius:covered_samples.add(i)return covered_samples# 定义模型
class CoveringNN(nn.Module):def __init__(self, input_size):super(CoveringNN, self).__init__()self.cover_list = []self.input_size = input_size# 我们使用内积来转换欧几里得距离函数，这里我给出自己的一些证明，内积<X1,X2>=X1.1*X2.1+...X1.n*X2.n# 欧几里得距离函数，(X1-X2)^2=X1.1^2+X1.2^2+...+X1.n^2+X2.1^2+X2.2^2+X2.n^2-2*(X1.1*X2.1+...X1.n*X2.n)# 要将此处转换，必须保证的前提是，每个样本的模长相等，我们经过归一化之后，每个样本的模长是R(即是所有样本中最大的模长)# 使用折中半径法，对于没有d1的情况，我们可以使用最大的同类样本距离去度量，即大于最小同类样本内积即可# 对于没有d2的情况，使用最小的异类样本距离，即用小于最大的异类样本内积即可def train_covering_nn(self, data):# 获取样本总数num_samples = len(data)print(num_samples)# 设置已学习样本的集合learned_samples = set()# 记录已学习的样本数量samples_learned_count = 0cover_index = 0# 循环的终止条件是训练集中的所有样本都被学习过while samples_learned_count < num_samples:# 检查是否还有未学习的样本remaining_samples = set(range(num_samples)) - learned_samplesprint(remaining_samples)if not remaining_samples:break  # 所有样本都已学习过，提前结束循环# 随机选择一个未学习的样本Aa_index = random.choice(list(remaining_samples))print(a_index)# 选择样本Aa_sample = torch.tensor(data.iloc[a_index, :-1].values, dtype=torch.float32)print(a_sample)# 找到异类样本B和内积值d1# 从数据集中选择标签与样本A不同的所有样本# 排除已经学习过的样本b_samples = data[(data.index.isin(remaining_samples)) & (data['Label'] != data.iloc[a_index, -1])]# 检查是否有这样的样本if len(b_samples) > 0:# 将所选样本的特征值转换为 PyTorch 张量b_samples_values = torch.tensor(b_samples.iloc[:, :-1].values, dtype=torch.float32)print(b_samples_values)# 计算每个b_samples中的样本与样本A的内积inner_products = torch.matmul(b_samples_values, a_sample)print(inner_products)# 找到b_samples中具有最大内积的样本的索引b_sample_index = inner_products.argmax().item()print("b_sample_index:", b_sample_index)# 提取具有最大内积的样本b_sample = b_samples_values[b_sample_index]print("b_sample:", b_sample)# 提取最大内积的值d1 = inner_products[b_sample_index].item()else:# 如果b_samples中没有样本，则将d1设为0d1 = 0print("Max inner_product", d1)# 找到同类样本C和内积值d2# 从数据集中选择标签与样本A相同的所有样本c_samples = data[(data.index.isin(remaining_samples)) & (data['Label'] == data.iloc[a_index, -1])]# 检查是否有这样的样本# 将所选样本的特征值转换为 PyTorch 张量c_samples_values = torch.tensor(c_samples.iloc[:, :-1].values, dtype=torch.float32)# 计算每个c_samples中的样本与样本A的内积inner_products_c = torch.matmul(c_samples_values, a_sample)valid_inner_products_indices = (inner_products_c > d1).nonzero().view(-1)if len(valid_inner_products_indices) > 0:min_inner_product_index = torch.argmin(inner_products_c[valid_inner_products_indices]).item()# 提取具有最小内积的样本c_sample = c_samples_values[valid_inner_products_indices[min_inner_product_index]]print("c_sample:", c_sample)# 提取最小内积的值d2 = inner_products_c[valid_inner_products_indices[min_inner_product_index]].item()else:# 如果没有满足条件的样本，则将d2设为d1d2 = d1print("Min inner_product", d2)# 如果d2 > d1，则更新覆盖模型if d2 >= d1:d = (d1 + d2) / 2learned_samples.add(a_index)# 创建新的覆盖结构体，其中包括样本A、覆盖半径d和样本A的标签new_cover = CoverStruct(a_sample, d, data.iloc[a_index, -1], cover_index)# 每个覆盖相当于一个神经元# 将新的覆盖添加到模型的覆盖列表中self.cover_list.append(new_cover)# 更新已学习样本的集合，将在新覆盖中的样本标记为已学习learned_samples.update(get_covered_samples(data, new_cover))# 更新已学习的样本数量samples_learned_count = len(learned_samples)print("已学习的样本数量", samples_learned_count)cover_index += 1print("训练完成。")cover_data = {'Center': [cover.center.numpy() for cover in self.cover_list],'Radius': [cover.radius for cover in self.cover_list],'Label': [cover.label for cover in self.cover_list],'Index': [cover.index for cover in self.cover_list]}cover_df = pd.DataFrame(cover_data)print("\n覆盖列表的数据框形式：")print(cover_df)def test_covering_nn(self, test_data):all_outputs = []for _, test_sample in test_data.iterrows():outputs = []test_sample_tensor = torch.tensor(test_sample[:-1].values, dtype=torch.float32)for cover in self.cover_list:inner_product = torch.matmul(test_sample_tensor, cover.center)result = 1 if inner_product > cover.radius else 0output = CoveringResult(cover.label, result)outputs.append(output)# 找到第一个被覆盖的样本covered_sample = Nonefor output in outputs:if output.result == 1:covered_sample = outputbreak# 如果有被覆盖的样本，则返回整个列表if covered_sample is not None:all_outputs.append(covered_sample.label)# 如果没有任何一个覆盖包含测试样本，则返回内积最大的元素的标签else:max_inner_product_index = Nonemax_inner_product_value = float('-inf')for i, output in enumerate(outputs):if output.result > max_inner_product_value:max_inner_product_value = output.resultmax_inner_product_index = iif max_inner_product_index is not None:max_inner_product_label = outputs[max_inner_product_index].labelall_outputs.append(max_inner_product_label)# 使用占位符值 -1 替换 None 值all_outputs = [label if label is not None else -1 for label in all_outputs]# 确保总是返回与测试样本数相匹配的预测标签列表return all_outputs if all_outputs else [-1] * len(test_data)# 首先读入数据集
def read_dataset(file_path):data = pd.read_csv(file_path)# 读取数据，假设第一行是数据# 标签预处理data.iloc[:, -1] = LabelEncoder().fit_transform(data.iloc[:, -1])# 创建一个SimpleImputer实例，使用均值进行填充imputer = SimpleImputer(strategy='mean')data.replace('?', np.nan, inplace=True)data = data.astype(float)data_imputed = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)return data# 对读入的数据进行归一化处理
def normalize_data(data):# 数据集的最后一列是标签，将其排除在归一化之外features = data.iloc[:, :-1]labels = data.iloc[:, -1]scaler = MinMaxScaler()# 将特征数据进行归一化normalized_features = scaler.fit_transform(features)# 构建归一化后的数据集，包含原标签列normalized_data_with_labels = pd.DataFrame(normalized_features, columns=features.columns)normalized_data_with_labels['Label'] = labelsreturn normalized_data_with_labels# 升维函数
def uplift_dimension(normalized_data):# 计算模最大值max_norm = torch.norm(torch.tensor(normalized_data.iloc[:, :-1].values, dtype=torch.float32), dim=1).max().item()# 升维uplifted_data = []for i, sample in enumerate(normalized_data.values):original_norm = torch.norm(torch.tensor(sample[:-1].astype(float)))uplifted_value = torch.sqrt(max_norm**2 - original_norm**2).item()uplifted_sample = list(sample[:-1]) + [uplifted_value] + [int(sample[-1])]uplifted_data.append(uplifted_sample)# 创建新的 DataFrameuplifted_columns = list(normalized_data.columns)[:-1] + ['Uplifted'] + [normalized_data.columns[-1]]uplifted_df = pd.DataFrame(uplifted_data, columns=uplifted_columns)return uplifted_df#读入数据集
file_path = 'example.data'data = read_dataset(file_path)
print(data)#开始归一化处理
normalized_features = normalize_data(data)
print(normalized_features)
#升维
uplifted_data = uplift_dimension(normalized_features)print(uplifted_data)input_size = len(uplifted_data.columns) - 1
model = CoveringNN(input_size)# 初始化十折交叉验证
kf = KFold(n_splits=10, shuffle=True, random_state=42)# 存储每折的性能指标
accuracy_list = []# 开始十折交叉验证
for train_index, test_index in kf.split(uplifted_data):# 划分训练集和测试集train_data, test_data = uplifted_data.iloc[train_index], uplifted_data.iloc[test_index]# 初始化模型model = CoveringNN(input_size)# 训练模型model.train_covering_nn(train_data)# 测试模型test_results = model.test_covering_nn(test_data)results = pd.DataFrame(test_results)merged_data = pd.concat([results, test_data.iloc[:, -1]], axis=1)print(merged_data)# 计算准确率并存储accuracy = accuracy_score(test_data.iloc[:, -1], results)accuracy_list.append(accuracy)# 计算平均准确率
average_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Average Accuracy: {average_accuracy * 100:.2f}%')