手把手教你用 spacy3 训练中文NER

文章目录

  • 模型文件下载
  • 训练模型
    • 准备数据
    • 转化成 doc_bin 格式
    • 模型训练配置
      • 生成初始配置
      • 补全完整配置
    • 开始训练
  • 测试模型
  • 参考文献

模型文件下载

https://github.com/explosion/spacy-models/releases?q=zh&expanded=true
在这里插入图片描述
简单测试一下ner效果,发现根本不能用
在这里插入图片描述

训练模型

准备数据

在这里插入图片描述

转化成 doc_bin 格式

from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spansnlp = spacy.blank('zh')   # 选择中文空白模型
doc_bin = DocBin()
for training_example in tqdm(TRAIN_DATA):text = training_example['text']labels = training_example['entities']doc = nlp.make_doc(text)ents = []for start, end, label in labels:span = doc.char_span(start, end, label=label, alignment_mode="contract")if span is None:print("Skipping entity")else:ents.append(span)filtered_ents = filter_spans(ents)doc.ents = filtered_entsdoc_bin.add(doc)doc_bin.to_disk("train.spacy")

train.spacydev.spacy 分别用来训练和测试

模型训练配置

生成初始配置

模型配置文件不用自己写,直接到官网上点击配置:https://spacy.io/usage/training#quickstart
在这里插入图片描述
通过简单勾选,得到一个初始配置文件 base_config.cfg

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null
vectors = "zh_core_web_lg"
[system]
gpu_allocator = null[nlp]
lang = "zh"
pipeline = ["tok2vec","ner"]
batch_size = 1000[components][components.tok2vec]
factory = "tok2vec"[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"]
rows = [5000, 1000, 2500, 2500]
include_static_vectors = true[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3[components.ner]
factory = "ner"[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}[corpora][corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"[training.optimizer]
@optimizers = "Adam.v1"[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001[initialize]
vectors = ${paths.vectors}

补全完整配置

接下来,需要用命令 spacy init fill-config [初始配置] [完整配置] 把上述初始配置补全为完整的训练配置

python -m spacy init fill-config spacy/base_config.cfg spacy/config.cfg

得到 config.cfg 文件如下,其中做了一些人工改动,例如 paths.vectors 默认选的是 zh_core_web_lg,我改成了 zh_core_web_md

[paths]
train = null
dev = null
vectors = "zh_core_web_md"
init_tok2vec = null[system]
gpu_allocator = null
seed = 0[nlp]
lang = "zh"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
vectors = {"@vectors":"spacy.Vectors.v1"}[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char"[components][components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
upstream = "*"[components.tok2vec]
factory = "tok2vec"[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
rows = [5000,1000,2500,2500]
include_static_vectors = true[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3[corpora][corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null[pretraining][initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null[initialize.components][initialize.tokenizer]
pkuseg_model = null
pkuseg_user_dict = "default"

开始训练

python -m spacy train spacy/config.cfg --output ./spacy/ --paths.train ./train.spacy --paths.dev ./dev.spacy

参数:

  • output 输出目录
  • paths.train 训练集文件
  • paths.dev 验证集文件

训练日志:

>>> python -m spacy train spacy/config.cfg --output ./spacy/ --paths.train ./train.spacy --paths.dev ./dev.spacy
ℹ Saving to output directory: spacy
ℹ Using CPU
ℹ To switch to GPU 0, use the option: --gpu-id 0=========================== Initializing pipeline ===========================
✔ Initialized pipeline============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------0       0          0.00     49.29    0.00    0.00    0.00    0.000     200        609.43   3515.82    6.61    7.99    5.63    0.070     400       1104.85   3590.05   10.22   10.26   10.19    0.100     600       1120.82   5038.80   16.23   17.95   14.81    0.160     800       1071.70   5578.76   10.95   14.11    8.95    0.110    1000       1151.26   6506.03   20.62   23.73   18.23    0.210    1200       1100.93   6840.94   26.60   32.95   22.30    0.270    1400       2058.58   7959.36   34.93   39.60   31.25    0.350    1600       1642.29   9632.10   40.32   45.09   36.46    0.401    1800       2580.55  11209.10   38.82   47.18   32.98    0.391    2000       2907.86  13187.84   44.31   52.42   38.38    0.441    2200       3575.63  15214.04   42.97   50.06   37.63    0.432    2400       4790.03  18126.32   48.39   51.29   45.80    0.482    2600       5653.69  17209.21   51.27   54.42   48.47    0.51

测试模型

nlp = spacy.load("spacy/model-best")
text = "我的名字是michal johnson,我的手机号是13425456344,我家住在东北松花江上8幢8单元6楼5号房。我叫王大,喜欢去旺角餐厅吃牛角包, 今年买了阿里巴巴的股票,我家住在新洲花园3栋4单元 8988-1室"
doc = nlp(text)for ent in doc.ents:print({"start": ent.start,"end": ent.end,"text": ent.text,"entity_group": ent.label_,})

在这里插入图片描述

参考文献

  1. https://ubiai.tools/fine-tuning-spacy-models-customizing-named-entity-recognition/
  2. https://spacy.io/usage/training
  3. https://ner.pythonhumanities.com/03_02_train_spacy_ner_model.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/691394.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

二叉树的非递归遍历(c++)

前序 . - 力扣(LeetCode). - 备战技术面试?力扣提供海量技术面试资源,帮助你高效提升编程技能,轻松拿下世界 IT 名企 Dream Offer。https://leetcode.cn/problems/binary-tree-preorder-traversal/description/ 1---2---4---5--…

百度地图API 快速入门

一、创建一个应用 创建成功可以在应用程序中查看到自己的ak密钥 二、基本使用 2.1 显示地图 在static下创建demo1.html &#xff08;将密钥换成自己的就可以显示地图了&#xff09; 示例&#xff1a; <!DOCTYPE html> <html> <head><meta name"…

windows和 Linux 下通过 QProcess 打开ssh 和vnc

文章目录 SSHSSH验证启动SSH一、口令登录二、公钥登录通过Qprocess 启动ssh VNC Viewer简介通过QProcess启动vncViewer SSH Secure Shell(SSH) 是由 IETF(The Internet Engineering Task Force) 制定的建立在应用层基础上的**安全网络协议**。它是专为远程登录会话(**甚至可以…

sscanf字符串解析

ATCIPSNTPTIME? //发生的指令 CIPSNTPTIME:Tue Oct 19 15:17:56 2021 //回复的数据 //接受数据缓存 char* recvStrBuf "CIPSNTPTIME:Tue Oct 19 15:17:56 2021"; char* weekStr; char* monthStr; int day,hour,minute,second,year; sscanf(recvStrBuf,""…

【计算机网络】计算机网络概述、计算机网络性能指标 习题1

0 1. 计算机网络可被理解为( )。 A.执行计算机数据处理的软件模块 B. 由自治的计算机互连起来的集合体 C.多个处理器通过共享内存实现的紧耦合系统 D. 用于共同完成一项任务的分布式系统 0 2.计算机网络最基本的功能是( )。 A.数据通信 B. 资源共享 C. 分布式处理 D. 信息综合…

保健品小程序商城线上经营的作用是什么

保健品涵盖酒水、醋、食品等多个类型&#xff0c;无论厂商还是经销商&#xff0c;手里的品牌和数量都比较多&#xff0c;由于特殊性&#xff0c;商家经营时需要找到目标客户&#xff0c;而市场中虽然有大量客户&#xff0c;但商家实际想要触达却并不容易。 渠道多样化&#xf…

情感感知OCR:整合深度学习技术提升文字识别系统的情感理解能力

摘要&#xff1a;随着深度学习技术的发展&#xff0c;文字识别&#xff08;OCR&#xff09;系统在识别准确率和速度上取得了长足的进步。然而&#xff0c;在处理文本时&#xff0c;仅仅依靠字符和词语的识别并不足以满足用户对信息的全面理解需求。本文提出了一种新颖的方法&am…

C++_红黑树的学习

1. 红黑树的概念 红黑树 &#xff0c;是一种 二叉搜索树 &#xff0c;但 在每个结点上增加一个存储位表示结点的颜色&#xff0c;可以是 Red 或 Black 。 通过对 任何一条从根到叶子的路径上各个结点着色方式的限制&#xff0c;红黑树确保没有一条路 径会比其他路径长出俩倍 &…

文本检测模型 DBNet 一种基于分割算法的模型 对每个像素点进行自适应二值化,并将二值化过程与网络训练相结合 可微分二值化模块 概率图

文本检测模型 DBNet DBNet文本检测模型是一种基于分割算法的模型,其优化之处在于对每个像素点进行自适应二值化,并将二值化过程与网络训练相结合。 传统的文本检测方法通常将二值化作为一个后处理步骤,与网络训练分开进行。而DBNet则提出了一种可微分的二值化方法,即将文…

Docker需要代理下载镜像

systemctl status docker查看docker的状态和配置文件是/usr/lib/systemd/system/docker.service vi /usr/lib/systemd/system/docker.service&#xff0c; 增加如下配置项 [Service] Environment"HTTP_PROXYhttp://proxy.example.com:8080" "HTTPS_PROXYhttp:…

【机器学习】集成学习在信用评分领域实例

集成学习在信用评分领域的应用与实践 一、引言二、集成学习的概念与原理三、集成学习在信用评分中的应用实例四、总结与展望 一、引言 在当今金融数字化快速发展的时代&#xff0c;信用评分成为银行、金融机构等评估个人或企业信用风险的重要工具。然而&#xff0c;单一的信用评…

欢乐钓鱼大师自动钓鱼,游戏辅助!

在探索《欢乐钓鱼大师》的世界时&#xff0c;一项备受关注的功能是陀螺仪模式。这是一种利用手机陀螺仪传感器来增强游戏体验的功能&#xff0c;通过模拟真实的钓鱼动作&#xff0c;让玩家更深入地沉浸在游戏的世界中&#xff0c;感受到更加逼真的钓鱼体验。在本篇攻略中&#…