机器学习

概述

人工智能三大概念

人工智能：AI，Artificial Intelligence
机器学习：ML，Machine Learning
深度学习：DL, Deep Learning

包含关系：人工智能 > 机器学习 > 深度学习

AI发展三要素

数据、算法、算力

样本、特征、标签

样本(Sample)：一行数据就是一个样本；多个样本组成数据集；有时一条样本被叫成一条记录
特征(feature) ：一列数据一个特征，有时也被称为属性
标签/目标(label/target) ：模型要预测的那一列数据
训练集(training set)：训练模型的数据集
测试集(testing set)：测试模型的数据集，和训练集的比例一般为8:2或7:3

监督学习

有监督学习：输入的训练数据有标签的
- 分类问题：标签值是不连续的
- 回归问题：标签值是连续的
无监督学习：输入数据没有被标记，即样本数据类别未知，没有标签；

根据样本间的相似性对样本集进行聚类

机器学习流程

获取数据
数据基本处理

缺失值处理、异常值处理等
特征工程

特征提取、预处理、降维等
模型训练

线性回归、逻辑回归、决策树、GBDT等
模型评估

特征工程

特征提取
特征预处理：解决因量纲问题，导致有些特征对模型影响大、有些影响小
- 归一化：（当前值 - 最小值）/（最大值 - 最小值）
- 标准化
特征降维
特征选择
特征组合

机器学习库

官网：https://scikit-learn.org/stable/
安装：pip install scikit-learn

KNN算法

K Nearest Neighbor，K邻近算法

思想

计算和样本的距离，找出最近的K个，按距离排序；
如果是分类问题，则这K个里哪种分类最多就为哪个（一样多选最近的）
如果是回归（预测）问题，则取这K个的均值

计算距离可以有多种方式，如欧式距离、曼哈顿距离等

K值选择

过小会导致过拟合
过大会导致欠拟合

代码实现

分类代码实现

from sklearn.neighbors import KNeighborsClassifier

# Classifier: 分类
x_train = [[0], [1], [2], [3]]  # 训练集特征
y_train = [0, 0, 1, 1]          # 训练集标签
x_test = [[5]]                  # 测试集特征

# 创建KNN分类模型对象
estimator = KNeighborsClassifier(n_neighbors=2)

# 训练
estimator.fit(x_train, y_train)

# 预测
y_pred = estimator.predict(x_test)
print(y_pred)

回归代码实现

from sklearn.neighbors import KNeighborsRegressor

# Regressor: 回归
x_train = [[0, 0, 1], [1, 1, 0], [3, 10, 10], [4, 11, 12]]  # 训练集特征
y_train = [0.1, 0.2, 0.3, 0.4]                              # 训练集标签
x_test = [[3, 11, 10]]                                      # 测试集特征

# 创建KNN回归模型
estimator = KNeighborsRegressor(n_neighbors=3)
# 模型训练
estimator.fit(x_train, y_train)

# 预测
y_pred = estimator.predict(x_test)
print(y_pred)

距离度量方式

欧式距离：类似勾股定理，对应特征的差值平方，求和再开根号
曼哈顿距离：abs(对应维度差值)之和
切比雪夫距离：max(对应维度差值的的绝对值)

特征预处理

解决量纲问题导致放差值相差较大，影响模型的最终结果

归一化：将值映射到指定区间里，容易受极值影响，适合小数据集

from sklearn.preprocessing import MinMaxScaler

x_train = [[90, 2, 10, 40], [60, 4, 14, 45], [75, 3, 13, 46]]

# 归一化对象, 默认[0, 1]
scaler = MinMaxScaler(feature_range=(0, 1))

# 进行归一化操作
x_train_new = scaler.fit_transform(x_train)

print(x_train_new)
 # [[1.         0.         0.         0.        ]
#  [0.         1.         1.         0.83333333]
#  [0.5        0.5        0.75       1.        ]]

标准化：将数据转化为标准正态分布，可以减少极值的影响，适合大数据集

(当前值 - 均值) / 标准差

from sklearn.preprocessing import StandardScaler

x_train = [[90, 2, 10, 40], [60, 4, 14, 45], [75, 3, 13, 46]]

# 标准化对象
scaler = StandardScaler()

# 进行标准化化操作
x_train_new = scaler.fit_transform(x_train)

print(x_train_new)
# [[ 1.22474487 -1.22474487 -1.37281295 -1.3970014 ]
#  [-1.22474487  1.22474487  0.98058068  0.50800051]
#  [ 0.          0.          0.39223227  0.88900089]]

实例

鸢尾花分类

from sklearn.datasets import load_iris                      #加载鸢尾花测试集的.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split        #分割功练集和测试集的
from sklearn.preprocessing import StandardScaler            #数据标准化的
from sklearn.neighbors import KNeighborsClassifier          #KNN算法分类对象
from sklearn.metrics import accuracy_score                  # 模型评估


# 加载数据集
def dm01_load_iris():
    iris_data = load_iris()
    print(iris_data)            # 此数据集是字典形式的
    print(iris_data.data[:5])   # 数据集的特征
    print(iris_data.target[:5])   # 数据集的标签
    print(iris_data.feature_names) # 每列特征的意思
    print(iris_data.target_names) # 标签对应的意思

# 绘制散点图
def dm02_draw_iris():
    iris_data = load_iris()
    # 封装为DataFrame对象
    iris_df = pd.DataFrame(data=iris_data.data, columns=iris_data.feature_names)
    iris_df['target'] = iris_data.target
    # print(iris_df)
    # 绘制散点图
    sns.lmplot(data=iris_df, x='sepal length (cm)', y='sepal width (cm)', hue='target', fit_reg=False)
    plt.title('iris data')
    plt.show()

# 划分训练集和测试集
def dm03_split_train_test():
    iris_data = load_iris()
    # 按8:2的比例划分训练集和测试集, random_state: 随机种子
    x_train, x_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.2, random_state=42)
    # print(len(x_train), len(x_test), len(y_train), len(y_test))

# 模型的评估和预测
def dm04_iris_evaluate_test():
    # 加载数据集
    iris_data = load_iris()
    # 切分为训练集和测试集
    x_train, x_test, y_train, y_test = train_test_split(iris_data.data, iris_data.target, test_size=0.2, random_state=42)
    # 数据预处理
    scaler = StandardScaler()
    x_train = scaler.fit_transform(x_train)     # 第一次进行标准化时，用fit_transform, 让scaler更契合数据集
    x_test = scaler.transform(x_test)           # 不是第一次进行标准化时, transform

    # 训练模型
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)

    # 模型预测
    y_pred = estimator.predict(x_test)                          # 对测试集进行预测
    print(f"测试集预测结果：{y_pred}")
    y_pred_custom = estimator.predict(scaler.transform([[7.8, 2.1, 3.9, 1.6]])) # 自定义的测试数据
    print(f"自定义数据预测结果：{y_pred_custom}")

    # 模型评估
    print(f"对训练集准确率: {estimator.score(x_train, y_train)}")    # 用训练集评估
    print(f"对测试集准确率: {accuracy_score(y_test, y_pred)}")       # 用测试集评估，比较标签和预测值


if __name__ == '__main__':
    dm04_iris_evaluate_test()

超参数选择方法

网格搜索

交叉验证：
1. 将训练集平均划分为N份（通常称N为折数），N - 1份作为训练集，1份作为验证集
2. 每次拿不同的1份作为测试集，总共训练N次
3. 将N轮得到的性能指标取平均值，这个平均值就是最终评估分数

网格搜索：

将若干超参数可能得取值传递给网格搜索对象，会自动完成不同超参数的组合

如超参数A可取值[0, 1, 2], 超参数B可取值[3, 4]

则超参数的组合有3 * 2 = 6种

对每组超参数都采用交叉验证，最后选出最优的

案例

from sklearn.datasets import load_iris                      #加载鸢尾花测试集的.
from sklearn.model_selection import train_test_split, GridSearchCV        #分割功练集和测试集的
from sklearn.preprocessing import StandardScaler            #数据标准化的
from sklearn.neighbors import KNeighborsClassifier          #KNN算法分类对象
from sklearn.metrics import accuracy_score                  # 模型评估


# 加载数据集
iris = load_iris()

# 分割训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# 数据预处理
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# 模型
estimator = KNeighborsClassifier()

# 超参数可能的取值
param_dict = {'n_neighbors': [i for i in range(1, 11)]}
# cv: 交叉验证的折数, 返回处理后的模型
estimator = GridSearchCV(estimator=estimator, param_grid=param_dict, cv=4)
# 训练模型
estimator.fit(x_train, y_train)
print(f'最优评分: {estimator.best_score_}')
print(f'最优超参组合: {estimator.best_params_}')
print(f'最优的估计器对象: {estimator.best_estimator_}')
print(f'具体的交叉验证结果: {estimator.cv_results_}')

手写数字识别

绘制数字

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split    # 分割数据集
from sklearn.neighbors import KNeighborsClassifier
import joblib                                           # 保存模型
from collections import Counter                         # 统计
from sklearn.metrics import accuracy_score              # 模型评估

# 根据索引画出对应的数字
def show_digit(idx):
    df = pd.read_csv('./data/手写数字识别.csv')
    if idx < 0 or idx >= len(df):
        print('索引越界')
        return
    # 第1列之后都为特征
    x = df.iloc[:, 1:]
    # 第0列为标签
    y = df.iloc[:, 0]
    # 将第idx行的784个灰度值重塑为28*28的形状
    x = x.iloc[idx].values.reshape(28, 28)
    # 根据灰度值绘制灰度图
    plt.imshow(x, cmap='gray')
    plt.axis('off')
    plt.show()

训练和保存模型

def train_model():
    df = pd.read_csv('./data/手写数字识别.csv')
    x = df.iloc[:, 1:]
    y = df.iloc[:, 0]
    # 数据预处理
    x = x / 255
    # 划分训练集和测试集, stratify: 让每个划分的集合的y更均匀
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

    # 模型训练
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)

    # 模型评估
    print(f'准确率: {estimator.score(x_test, y_test)}')
    print(f'准确率: {accuracy_score(y_test, estimator.predict(x_test))}')
    joblib.dump(estimator, './model/手写数字识别.pkl')    # pickle, 后缀用pkl/pth/pickle都行
    print("保存成功")

使用保存的模型

def use_model():
    # 用来测试的图片(28 * 28)
    img = plt.imread("./data/demo.png")
    img = img.reshape(1, -1)    # 第二个数为-1时表示展平，不需要再归一化，imread时自动做了
    # 加载保存的模型
    estimator = joblib.load('./model/手写数字识别.pkl')
    # 使用
    y_pred = estimator.predict(img)
    print(y_pred)

线性回归

用线性公式来描述多个自变量（特征）和一个因变量（标签）之间的关系

属于有监督学习，标签是连续的

概念

一元线性回归：y = wx + b（w为权重，b为偏置），目标值与单个因变量有关
多元线性回归：y = (w1x1 + b1) + (w2x2 + b2) + … + b，目标值与多个因变量有关

目标是找出能使损失函数最小的权重和偏置

损失函数

Loss Function, 用来评判模型的好坏，值越小表示误差越小

损失函数种类

均方误差（MSE）：(预测值 - 真实值)^2求和 / 样本总数
平均绝对误差（MAE）：|预测值 - 真实值|求和 / 样本总数
均方根误差（RMSE）：均方误差的值开根号

找让损失函数最小的权重和偏置的方法

最小二乘法

最小二乘法可以直接计算出最优的w和b，前提是数据量不大，且方程有解析解

一元线性回归的情况：
1. 损失函数分别对w（权重）和b（偏置）分别求偏导
2. 然后联立起来求偏导为0的值，来让损失函数最小
多元线性回归的情况：
1. 多元线性回归方程式：y = w1x1 + w2x2 + w3x3 + … + b = w^Tx + b
2. 把一个样本的特征值带入x1、x2、….得预测值
3. 损失函数 = 累加每个样本的（真实值 - 预测值）^ 2
1. 对损失函数的w（包含所有权重的矩阵）求导，得到方程
2. 可知要求w时X^(-1)要存在，不是所有情况都能用

✨梯度下降法

沿着梯度下降的方向求解极小值

一元线性回归的情况：
1. 随机初始化 w 和 b，确定一个学习率α，选择迭代次数
2. 在每一步迭代中，使用当前 w 和 b 计算损失函数对 w 和 b 的偏导数（梯度）
3. 沿梯度的反方向更新w和b（梯度的方向是上升最快的方向）
4. 直到迭代了设置的迭代次数、或者损失函数收敛到了指定的阈值内时停止
多元线性回归的情况：

对所有的参数求偏导，然后同时更新

梯度下降法例子

梯度下降法分类

根据每轮迭代中用于计算梯度的样本数量来分类

批量梯度下降法(Batch(Full) Gradient Descent, BGD/FGD)

定义：使用全部训练数据来计算损失函数的梯度

优点：能保证收敛到全局最优解

缺点：速度慢，内存占用大
随机梯度下降法 (Stochastic Gradient Descent, SGD)

定义： 每次参数更新时，只使用随机的一个训练样本来计算梯度

优点：计算速度快

缺点：收敛性较差，通常只能震荡地接近最优解
✨小批量梯度下降法 (Mini-Batch Gradient Descent, MBGD)

定义：每次参数更新时，使用一小批（Mini-Batch）训练样本（通常是 32、64、128 等）来计算梯度

优点：结合和BGD和SGD的特点

代码

正规方程

from sklearn.preprocessing import StandardScaler    # 特征处理
from sklearn.model_selection import train_test_split # 数据集划分
from sklearn.linear_model import LinearRegression   # 正规方程的回归模型
from sklearn.linear_model import SGDRegressor       # 梯度下降的回归模型
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error  # 评估

import pandas as pd
import numpy as np

# 数据集
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])  # 特征
target = raw_df.values[1::2, 2]     # 标签

# 切分训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# 预处理
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

# 线性回归的正规方程模型, fit_intercept: 是否要有偏置
estimator = LinearRegression(fit_intercept=True)
estimator.fit(x_train, y_train)

print(f"权重：{estimator.coef_}")
print(f"偏置：{estimator.intercept_}")

# 预测
y_pred = estimator.predict(x_test)
print(f"预测结果：{y_pred}")

# 评估
# MSE：均方误差
print("均方误差: ", mean_squared_error(y_test, y_pred))
# RMSE：均方根误差
print("均方根误差: ", root_mean_squared_error(y_test, y_pred))
# MAE：平均绝对误差
print("平均绝对误差: ", mean_absolute_error(y_test, y_pred))

梯度下降法

1
2
3

# 将上面代码的模型换成随机梯度下降的SGDRegressor即可
# learning_rate: 学习率为常数, eat0: 学习率
estimator = SGDRegressor(fit_intercept=True, learning_rate="constant", eta0=0.01)

学习笔记 > AI

#AI

机器学习

http://xwww12.github.io/2025/11/24/AI/机器学习/

作者

发布于

2025年11月24日

许可协议

Framer Motion 下一篇