Data Preprocessing Phase 数据预处理
1. Feature extraction 特征提取
1. An object is described by a collection of attributes
一个对象可以由一组特征来描述
2. A feature is a property or a characteristic of an objects
物体的属性
2. Data cleaning 数据清洗
Extracted data may have erroneous or missing fields
提取的数据可能有错误或者缺失字段
方法包括:
(估计一个缺失值 、消除不一致的值)
3. Feature selection & transformation
Many data mining algorithms do not work efficiently on high dimensional data
高纬度的数据不利于数据挖掘算法
1. 方法包括:
identify and remove irrelevant features
识别并且去除不相关的特征
transform the current set of features to a new data space
改变特征的形式,例如降维..
2. Data transformation
Transform attributes to new attributes 改变属性的形式
(e.g., numerical age -> { young , middleaged , elderly })
例子:数字年龄 -> 年轻, 中等年龄, 老年
Types of Data 数据形式
1. Nondependency-oriented data:
objects do not have dependencies
Types of data:
1. Numerical or quantitative (values have natural ordering) 数值或数量
integer values (number of petals in a flower)
real values (length of a petal)
2. Categorical or unordered discrete-valued 离散的无序值/类别
discrete unordered values/categories (colour of a flower petal)
3. Binary data (two values: 0 and 1) 二进制数据
Can be seen as a categorical data (two categories) or a numerical data (0<1)
Can be used to represent Set Data via characteristic vectors
4. Text data 文本数据
Document as a string (dependency-oriented data type)
Document as a set of words or terms (vector-space representation: frequencies of the words in the document)
2. Dependency-oriented data:
implicit or explicit dependencies between objects may exist 数据之间存在显示或隐式的依赖关系
网络:节点(对象)通过边缘(关系)连接
从传感器收集的连续测量值
1. Implicit dependencies
没有显示的指定关系,但是是知道这个关系存在的
比如:温度值是一个传感器测量出来的,那么这个值和这个传感器就有隐性的依赖关系
Types of data with implicit dependencies
一些例子:
1. 时间和数据 2. 空间和数据
2. Explicit dependencies
会有edges来指定明确的关系
Graphs or network data (edges specify explicit relationships)
Types of data with explicit dependencies
Data Representation 数据表示
Data representation is one of the first things we must do in data mining
What we can mine is largely determined by our data representation
There is no one best data representation method for all data mining tasks 没有一个最好的数据表示可以用于所有数据挖掘