Data Mining数据挖掘—2. Classification分类

3. Classification

Given a collection of records (training set)
– each record contains a set of attributes
– one of the attributes is the class (label) that should be predicted
Find a model for class attribute as a function of the values of other attributes
Goal: previously unseen records should be assigned a class as accurately as possible
Application area: Direct Marketing, Fraud Detection
Why need it?
Classic programming is not so easy due to missing knowledge and difficult formalization as an algorithm

3.1 Definition

Given: a set of labeled records, consisting of
• data fields (a.k.a. attributes or features)
• a class label (e.g., true/false)
Generate: a function which can be used for classifying previously unseen records
• input: a record
• output: a class label

3.2 k Nearest Neighbors

The k nearest neighbors of a record x are data points that have the k smallest distance to x.
require: stored records, distance metric, value of k
steps: Compute distance to each training record -> Identify k nearest neighbors –> Use class labels of nearest neighbors to determine the class label of unknown record by taking majority vote or weighing the vote according to distance

choose k
Rule of thumb: Test k values between 1 and 10.
k too small -> sensitive to noise point
k too large -> neighborhood may include points from other classes

It’s very accurate but slow as training data needs to be searched. Can handle decision boundaries that are not parallel to the axes.

3.3 Nearest Centroids = Rocchio classifier

assign each data point to nearest centroid (center of all points of that class) 每个类别视为一个点(质心),然后在测试阶段,根据未标记数据点与这些质心的距离来决定数据点所属的类别。Nearest Centroid is a simple classification algorithm that calculates the centroid (average position) of each class based on training data. During testing, it classifies new instances by assigning them to the class whose centroid is closest.

k-NN vs. Nearest Centroid
k-NN

  • slow at classification time (linear in number of data points)
  • requires much memory (storing all data points)
  • robust to outliers

Nearest Centroid

  • fast at classification time (linear in number of classes)
  • requires only little memory (storing only the centroids)
  • robust to label noise
  • robust to class imbalance

Which classifier is better? Strongly depends on the problem at hand!

3.4 Bayes Classifier

P(C|A): conditional probability (How likely is C, given that we observe A)
Conditional Probability and Bayes Theorem

Example1 - Bayes Throrem
Example2 - Bayes Throrem

Estimating the Prior Probability P(C )
counting the records in the training set that are labeled with class Cj, dividing the count by the overall number of records
Estimating the Conditional Probability P(A | C)
Naïve Bayes assumes that all attributes are statistically independent
The independence assumption allows the joint probability P(A|C) to be reformulated as the product of the individual probabilities P(Ai|Cj).
P(A1,A2,…A3|Cj) = P(A1|Cj) * P(A2|Cj) * … * P(An|Cj)
Estimating the Probabilities P(Ai|Cj)
count how often an attribute value co-occurs with class Cj, divide by the overall number of instances in class Cj
Bayes Theorem Example

Handling Numerical Attributes
1.Discretize numerical attributes before learning classifier.
2.Make assumption that numerical attributes have a normal distribution given the class.
Use training data to estimate parameters of the distribution (e.g., mean and standard deviation)
Once the probability distribution is known, it can be used to estimate the conditional probability P(Ai|Cj)
Normal Distribution
Normal Distribution Example 1

Normal Distribution Example 2

Handling Missing Values
Missing values may occur in training and in unseen classification records. 一定要对缺失值进行处理

Zero Frequency Problem
one of the conditional probabilities is zero -> the entire expression becomes zero
It is not unlikely that an exactly same data point has not yet been observed. 有可能存在尚未观察到完全相同的数据点,因此概率有可能是0。
解决方法:Laplace Smoothing Laplace Smoothing

Decision Boundary of Naive Bayes Classifier
Decision Boundary of Naive Bayes Classifier

Summary

  • Robust to isolated noise points
  • Handle missing values by ignoring the instance during probability estimate calculations在概率估计计算过程中,通过忽略具有缺失值的实例来处理缺失值
  • Robust to irrelevant attributes [reasons: the probabilistic framework + conditional independence assumption + probability smoothing techniques]
  • Independence assumption may not hold for some attributes [Use other techniques such as Bayesian Belief Networks (BBN)]

Naïve Bayes works surprisingly well even if independence assumption is clearly violated [Reasons: Robustness to Violations + Effective Classification + Maximum Probability Assignment + Simple and Efficient]

Too many redundant attributes will cause problems. Solution: Select attribute subset as Naïve Bayes often works as well or better with just a fraction of all attributes.

Technical advantages:
(1) Learning Naïve Bayes classifiers is computationally cheap (probabilities are estimated in one pass over the training data)
(2) Storing the probabilities does not require a lot of memory

Redundant Variables
Violate independence assumption in Naive Bayes [Can, at large scale, skew the result]
May also skew the distance measures in k-NN, but the effect is not as drastic (Depends on the distance measure used)

Irrelevant Variables
For Naive Bayes: p(x=v|A) = p(x=v|B) for any value v, since it is random, it does not depend on the class variable. The overall result does not change

kNN vs. Naïve Bayes

kNNNaïve Bayes
computation\faster
dataless sensitive to outliersuse all data, less sensitive to label noise
redundant attributesless problematic\
irrelevant attributes\less problematic
pre-selectionyesyes

3.5 Lazy vs. Eager Learning

K-NN is a “lazy” methods.
They do not build an explicit model! “learning” is only performed on demand for unseen records.
Nearest Centroid and Naive Bayes are simple “eager” methods (also: decision tree, rule sets, …)
classify unseen instances, learn a model

3.6 Model Evaluation

3.6.1 Metrics for Performance Evaluation

Confusion MatrixConfusion Matrix

Accuracy & Error Rate
Accuracy & Error Rate

Baseline: naive guessing(always predict majority class)

3.6.2 Limitation of Accuracy: Unbalanced Data

1. Precision & Recall
Precision & Recall

2. F1-Measure越大越好
F1-Score combines precision and recall into one measure by using harmonic mean (tends to be closer to the smaller of the two)
For F1-value to be large, both p and r must be large. 当 Precision 和 Recall 都很高时,F1 score 也会趋向于高值,表示模型在正类别和负类别的预测中取得了较好的平衡。
F1-Score
F1-Measure Graph

confidence scores: how sure the algorithms is with its prediction
3. ROC Curves:

  1. Sort classifications according to confidence scores对于每个测试样本,模型输出一个置信度分数,例如,在朴素贝叶斯中可能是预测的概率。首先,将所有测试样本按照这些置信度分数从高到低排序。
  2. Evaluate
    correct prediction -> draw one step up如果模型的预测是正确的,将 ROC 曲线向上移动一步(增加真正例率)。
    incorrect prediction -> draw one step to the right如果模型的预测是错误的,将 ROC 曲线向右移动一步(增加假正例率)。
    Interpreting ROC Curves

False Positive Rate是指在所有实际为负例的样本中,被错误地判定为正例的比例,FPR = FP / (FP+TN)
True Positive Rate指在所有实际为正例的样本中,被正确地判定为正例的比例=召回率,TPR = TP / (TP+FN)
曲线越接近左上角: 表示模型性能越好,具有更高的TPR和更低的FPR。
4. Cost Matrixcost matrix
Computing Cost of Classification

3.7 Decision Tree Classifiers

Decision Tree Classifiers

There can be more than one tree that fits the same data!
Decision Boundary
Decision boundary: border line between two neighboring regions of different classes.
Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
Finding an optimal decision tree is NP-hard
Tree building algorithms use a greedy, top-down, recursive partitioning strategy to induce a reasonable solution, also known as: divide and conquer. For example, Hunt’s Algorithm, ID3, CHAID, C4.5

3.7.1 Hunt’s Algorithm

Let Dt be the set of training records that reach a node t.
General Procedure:
If Dt contains only records that belong to the same class yt, then t is a leaf node labeled as yt
If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets
Recursively apply the procedure to each subset
Hunt's Algorithm

3.7.2 Split

Splitting Based on Nominal Attributes

Splitting Based on Nominal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Ordinal Attributes
Splitting Based on Continuous Attributes
Splitting Based on Continuous Attributes

Discretization to form an ordinal categorical attribute
equal-interval binning
equal-frequency binning
binning based on user-provided boundaries

Binary Decision: (A < v) or (A > v)
usually sufficient in practice
consider all possible splits
find the best cut (i.e., the best v) based on a purity measure
can be computationally expensive

3.7.3 Common measures of node impurity

How to determine the Best Split?
Nodes with homogeneous class distribution are preferred. Need a measure of node impurity.
1. Gini Index
GINI
Splitting Based on GINI

Binary Attributes
Binary Attributes: Computing GINI Index

Continuous Attributes
Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

Continuous Attributes: Computing Gini Index

2. Entropy (Information Gain)
Entropy

Splitting Based on Information Gain

How to Find the Best Split

Decision Tree Advantages & Disadvantages
Advantages: Inexpensive to construct; Fast at classifying unknown records; Easy to interpret by humans for small-sized trees; Accuracy is comparable to other classification techniques for many simple data sets
Disadvantages: Decisions are based only one a single attribute at a time; Can only represent decision boundaries that are parallel to the axes

Decision Tree vs. k-NN

k-NNDecision Tree
Decision Boundariesarbitraryrectangular
Sensitivity to Scalesneed normalizationdoes not need normalization
Runtime & Memorycheap to train but expensive for classificationexpensive to train but cheap for classification

3.8 Overfitting

Overfitting: Good accuracy on training data, but poor on test data.
Symptoms: Tree too deep and too many branches
Typical causes of overfitting: too little training data, noise, poor learning algorithm
Which tree do you prefer?
Occam’s Razor
If you have two theories that explain a phenomenon equally well, choose the simpler one!

Learning Curve
Learning Curve

Holdout Method
The holdout method reserves a certain amount for testing and uses the remainder for training
Typical: one third for testing, the rest for training

For unbalanced datasets (few or none instances of some classes), samples might not be representative. -> Stratified sample: balances the data - Make sure that each class is represented with approximately equal proportions in both subsets. Other attributes may also be considered for stratification, e.g., gender, age, …

Leave One Out
Iterate over all examples
– train a model on all examples but the current one
– evaluate on the current one
每个样本都被当作测试集,而其余的样本组成训练集。这个过程会重复执行,直到每个样本都被作为测试集被验证过一次。
Yields a very accurate estimate but is computationally infeasible in most cases

Cross-Validation (k-fold cross-validation)
Compromise of Leave One Out and decent runtime
Cross-validation avoids overlapping test sets
Steps:
Step 1: Data is split into k subsets of equal size (Stratification may be applied)
Step 2: Each subset in turn is used for testing
and the remainder for training
The error estimates are averaged to yield an overall error estimate
Frequently used value for k : 10 (the gold standard for folds was long set to 10)

How to Address Overfitting?

  1. Pre-Pruning (Early Stopping Rule)
    Stop the algorithm before it becomes a fully-grown tree
    Typical stopping conditions for a node: Stop if all instances belong to the same class or Stop if all the attribute values are the same
    Less restrictive conditions: Stop if number of instances within a node is less than some user-specified threshold or Stop if expanding the current node only slightly improves the impurity measure (user-specified threshold)
  2. Post-pruning
    Grow decision tree to its entire size -> Trim the nodes of the decision tree in a bottom-up fashion [ using a validation data set or an estimate of the generalization error ] -> If generalization error improves after trimming [ replace sub-tree by a leaf node / Class label of leaf node is determined from majority class of instances in the sub-tree]

Training vs. Generalization Errors
Training error = resubstitution error = apparent error: errors made in training, misclassified training instances -> can be computed
Generalization error: errors made on unseen data, no apparent evidence -> must be estimated

Estimating the Generalization Error
Estimating the Generalization Error

Example of Post-Pruning
Example of Post-Pruning

3.9 Alternative Classification Methods

Some cases are not nicely expressible in trees and rule sets.(Example: if at least two of Employed, Owns House, and Balance Account are yes → Get Credit is yes)

3.9.1 Artificial Neural Networks (ANN)

ANN

General Structure of ANN

Algorithm for learning ANN

Decision Boundaries of ANN: Arbitrarily shaped objects & Fuzzy boundaries

3.9.2 Support Vector Machines

Find a linear hyperplane (decision boundary) that will separate the data.SVM

What is computed?
a separating hyper plane, defined by its support vectors (hence the name)

“support vectors” refer to the training data points that are crucial in defining the decision boundary (or hyperplane).
Challenges: Computing an optimal separation is expensive and it requires good approximations
Dealing with noisy data: introducing “slack variables” in margin computation

3.9.3 Nonlinear Support Vector Machines

  • Transform data into higher dimensional space将输入数据从原始特征空间映射到更高维的空间
  • Transformation in higher dimensional space
    Kernel function
    Different variants: polynomial function, radial basis function, …
  • Finding a hyperplane in higher dimensional space

3.10 Hyperparameter Selection

A hyperparameter is a parameter which influences the learning process and whose value is set before the learning begins. For example, pruning thresholds for trees and rules; gamma and C for SVMs; learning rate, hidden layers for ANNs.
parameters are learned from the training data. For example, weights in an ANN, probabilities in Naïve Bayes, splits in a tree.

How to determine good hyperparameters?
(1) manually play around with different hyperparameter settings
(2) have your machine automatically test many different settings(Hyperparameter Optimization)

Hyperparameter Optimization
Goal: Find the combination of hyperparameter values that results in learning the model with the lowest generalization error
How to determine the parameter value combinations to be tested?

  • Grid Search: Test all combinations in user-defined ranges
  • Random Search: Test combinations of random parameter values
  • Evolutionary Search: Keep specific parameter values that worked well

3.11 Model Selection

From all learned models M, select the model mbest that is expected to generalize best to unseen records

Model Selection Using a Validation Set

Model Selection using Cross-Validation
Model Evaluation using Nested Cross Validation

grid search for model selection
cross-validation for model evaluation

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/256949.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

从传统到胜利:广汽集团汽车产业创新之旅

置身于汽车行业百年未有之大变局&#xff0c;作为传统车企中的排头兵&#xff0c;广汽创新可圈可点&#xff0c;广汽近年来取得了骄人业绩&#xff0c;不论是整体产销规模&#xff0c;还是新能源汽车产业化、新技术领域开拓等&#xff0c;都呈现节节攀升的局面。本文奖从产业变…

基于ssm少儿编程管理系统源码和论文

idea 数据库mysql5.7 数据库链接工具&#xff1a;navcat,小海豚等 环境&#xff1a; jdk8 tomcat8.5 开发技术 ssm 基于ssm少儿编程管理系统源码和论文744 摘要 网络的广泛应用给生活带来了十分的便利。所以把少儿编程管理系统与现在网络相结合&#xff0c;利用java技术建设…

使用Caliper对Fabric地basic链码进行性能测试

如果你需要对fabric网络中地合约进行吞吐量、延迟等性能进行评估&#xff0c;可以使用Caliper来实现&#xff0c;会返回给你一份网页版的直观测试报告。下面是对test-network网络地basic链码地测试过程。 目录 1. 建立caliper-workspace文件夹2. 安装npm等3. calipe安装4. 创建…

6-4 是否二叉搜索树 分数 20

bool IsBST(BinTree T) {//空树 or 只有一个结点if (T NULL || (T->Left NULL) && (T->Right NULL))return true;BinTree cur NULL;cur T->Left;if (cur ! NULL){while (cur->Right)cur cur->Right;if (cur->Data > T->Data)return fals…

vue 商品列表案例

my-tag 标签组件的封装 1. 创建组件 - 初始化 2. 实现功能 (1) 双击显示&#xff0c;并且自动聚焦 v-if v-else dbclick 操作 isEdit 自动聚焦&#xff1a; 1. $nextTick > $refs 获取到dom&#xff0c;进行focus获取焦点 2. 封装v-focus指令 (2) 失去焦点&#xff0c;隐藏…

【Flink系列四】Window及Watermark

3.1、window 在 Flink 中 Window 可以将无限流切分成有限流&#xff0c;是处理有限流的核心组件&#xff0c;现在 Flink 中 Window 可以是时间驱动的&#xff08;Time Window&#xff09;&#xff0c;也可以是数据驱动的&#xff08;Count Window&#xff09;。 Flink中的窗口…

Python----多态

1、什么是多态 多态指的是一类事物有多种形态。 定义&#xff1a;多态是一种使用对象的方式&#xff0c;子类重写父类方法&#xff0c;调用不同子类对象的相同父类方法&#xff0c;可以产生不同的执行结果。 ① 多态依赖继承 ② 子类方法必须要重写父类方法 首先定义一个父类…

VC++使用GetProcessTimes获取进程创建时间、销毁时间、用户态时间、内核态时间

一、GetProcessTimes函数简介&#xff08;微软MSDN&#xff09; 微软提供了一个非常有用的API函数GetProcessTimes用来获取进程创建时间、销毁时间、用户态时间、内核态时间&#xff0c;msdn连接为&#xff1a;GetProcessTimes 函数 (processthreadsapi.h) 其函数原型为&#…

【计算机组成体系结构】SRAM和DRAM

RAM — Random Access Memory 随机访问存储器 —指定某一存储单元地址的时候&#xff0c;存储单元的读取速度并不会因为存储单元的物理位置改变 SRAM即为 Static RAM 静态随机访问存储器 — 用于主存DRAM即为 Dynamic RAM 动态随机访问存储器 — 用于Cache 一、SRAM和DRAM的特…

智能外呼常见场景有哪些?

智能外呼常见场景是什么&#xff1f; 智能外呼在各种场景下都有应用&#xff0c;以下是一些常见的场景&#xff1a; 营销推广 通过智能外呼向潜在客户进行产品或服务的宣传和推广&#xff0c;收集客户对产品或服务的反馈。根据客户的反馈自动调整宣传策略&#xff0c;从而提…

JavaSE基础50题:15. 斐波那契数列(迭代实现)

概述 求斐波那契数列的第n项&#xff08;迭代实现&#xff09;。 斐波那契&#xff1a;1 1 2 3 5 8…… 代码 public class P15 {public static int fib(int n) {if (n 1 || n 2) {return 1;}int f1 1;int f2 1;int f3 0;for (int i 3; i < n; i) {f3 f1 f2;f1 …

Java TCP(一对一)聊天简易版

客户端 import java.io.*; import java.net.Socket; import java.util.Date; import javax.swing.*;public class MyClient {private JFrame jf;private JButton jBsend;private JTextArea jTAcontent;private JTextField jText;private JLabel JLcontent;private Date data;p…