连接时序分类 Connectionist Temporal Classification (CTC)

CTC全称Connectionist temporal classification,是一种常用在语音识别、文本识别等领域的算法,用来解决输入和输出序列长度不一、无法对齐的问题。在CRNN中,它实际上就是模型对应的损失函数(CTC loss)。

一、背景

字母和语音的对齐(align)非常困难而且容易出错的,因为很多音素的边界是很难区分。CTC不要求训练数据的对齐,因此非常适合语音识别和手写文字识别这种问题。

我们引入一些记号更加形式化的描述CTC要解决的问题。首先我们假设输入序列X = [x_1, x_2,..., x_T],比如在语音识别中,它是T个帧,每一帧 x_t 是39维的MFCC特征。输出序列是 Y=[y_1, y_2, ..., y_U]。这个任务我们很难把它转化为简单的分类任务,因为:

  • X和Y都是变长的
  • X和Y的长度比也是变化的(X和Y的长度不存在简单的比例对应关系)
  • 训练数据中没有X和Y的对齐

NLP的任务存在清晰的边界。NLP的输入和输出都是逻辑上的符号,每个符号的边界是显然的,但是语音信号的边界是模糊不清的。CTC可以解决这些问题,给定一个输入X,CTC可以对所有可能的Y计算P(Y \vert X)。有了这个概率,我们就可以推断最可能的输出或者计算某个Y的概率。训练的时候,我们需要计算损失函数去更新参数。预测时,找到最可能的Y,Y^*=\underset{Y}{argmax}P(Y|X)。CTC虽然没有精确的算法来高效的计算最优路径,但是它提供近似的算法使得我们能在合理的时间内找到较优的路径。

二、CTC算法

给定X时,CTC算法可以计算所有输出Y的概率。去计算我们需要解决1. 对齐 2. 损失函数

1. 对齐 

CTC算法不需要对齐输入和输出的。CTC会枚举所有可能的对齐方式然后把这些概率累积起来。先来尝试一个简单的对齐,每个输入对应一个字符:

这种简单的对齐有两个问题:

  • 每个输入不一定都对应实际的输出
    比如在语音识别中会有静音(silence),这些输入不对应任何输出
  • 没办法输出连续相同的字符
    比如假设有一个单词caat,那么上面的对齐只能认为输出是cat。

为了解决上述问题,CTC引入了一个新的特殊符号 \epsilon ,它表示空字符,在最后我们会去掉它。

如果输出有两个连续相同的字符,那么它们之间一定要有至少一个空字符,这样我们就可以区分hello和helo了。CTC对齐有如下一些特性。首先对齐是单调的(monotonic),If we advance to the next input, we can keep the corresponding output the same or advance to the next one.。第二个特点就是输入与输出是多对一的关系。这个特性可以推出如下结论:输入序列的长度一定是大于等于输出长度的。

2. 损失函数

The CTC alignments give us a natural way to go from probabilities at each time-step to the probability of an output sequence.

 上图中,RNN模型会计算每一个时刻t的输出的概率分布p_t(a \vert X),表示t时刻输出字符a的概率。假设输入的长度为T,那么理论上有5^T中不同的对齐方式(路径),当然有些概率很低,我们可以忽略。这些路径中有一些的输出是一样的,比如都是”hello”,我们把它的概率加起来就得到了P("hello" \vert X) 的概率。即:

P(Y|X)= \underset{A \in \mathcal{A}_{X,Y}}{\sum} \prod_{t=1}^{T}p_t(a_t|X)     (marginalizes over the set of valid alignments)

给定X和Y,如果我们直接遍历所有的路径,那么效率会非常低,因为路径会随着T指数增加。不过我们可以使用动态规划技术来提高计算效率。Thankfully, we can compute the loss much faster with a dynamic programming algorithm. The key insight is that if two alignments have reached the same output at the same step, then we can merge them.

Since we can have an ϵ before or after any token in Y, it’s easier to describe the algorithm using a sequence which includes them. We’ll work with the sequence

Z=[\epsilon, y_1, \epsilon, y_2, ..., y_U, \epsilon]

which is Y with an ϵ at the beginning, end, and between every character.

let \alpha be the score of the merged alignments at a given node. More precisely, \alpha_{s,t}​ is the CTC score of the subsequence Z_{1:s}​ after t input steps. As we’ll see, we’ll compute the final CTC score, P(Y|X), from the \alpha’s at the last time-step. As long as we know the values of \alpha at the previous time-step, we can compute \alpha_{s,t}. There are two cases.

Case 1: 不能跳过 z_{s-1}。
原因:z_{s-1}可能是Y中的元素,即z_s =
 ϵ
z_{s-1}是ϵ 来分割相同的两个元素,即z_s = z_{s-2}
Case 2: 能跳过 z_{s-1}。
z_{s-1}是两个独特的元素之间的ϵ 

具体的:

  • Case 1:\alpha_{s,t} = (\alpha_{s-1,t-1}+\alpha_{s,t-1}) \cdot p_t(z_s|X)
  • Case 2:\alpha_{s,t} = (\alpha_{s-2,t-1} + \alpha_{s-1,t-1}+\alpha_{s,t-1}) \cdot p_t(z_s|X)

Below is an example of the computation performed by the dynamic programming algorithm. Every valid alignment has a path in this graph.

There are two valid starting nodes and two valid final nodes since the ϵ at the beginning and end of the sequence is optional. The complete probability is the sum of the two final nodes.

Now that we can efficiently compute the loss function, the next step is to compute a gradient and train the model. The CTC loss function is differentiable with respect to the per time-step output probabilities since it’s just sums and products of them. Given this, we can analytically compute the gradient of the loss function with respect to the (unnormalized) output probabilities and from there run backpropagation as usual.

For a training set \mathcal{D}, the model’s parameters are tuned to minimize the negative log-likelihood

L=\underset{(X,Y) \in \mathcal{D}}{\sum}-log P(Y|X)

instead of maximizing the likelihood directly.

3. 预测 Inference

模型训练好了之后,我们需要用它来预测最可能的结果。具体来说,我们需要解决如下问题:

Y^*=\underset{Y}{argmax}P(Y|X)

可以用贪心算法,每个时间步都取概率最大的对齐。但其没有考虑到单个输出可能有多种对齐。举个例子,Assume the alignments [a, a, ϵ] and [a, a, a] individually have lower probability than [b, b, b]. But the sum of their probabilities is actually greater than that of [b, b, b]. The naive heuristic will incorrectly propose Y= [b] as the most likely hypothesis. It should have chosen Y= [a].

我们可以使用一个改进版的Beam Search方法来搜索,虽然它不能保证找到最优解,但是我们可以调整beam的大小,beam越小,速度越快;beam越大,搜索的解越好。极限的情况是,如果beam是1那么它等价与前面的算法;如果beam是所有字母的个数,那么它会遍历所有路径,保证能找到最优解。

普通的Beam Search方法会在每个时刻保留最优的N条路径,然后在t+1时刻对这N条路径展开,然后从所有展开的路径中选择最优的N条路径,一直到最终时刻T。下图是使用普通Beam Search算法的示例(beam大小=3)。在图中,我们发现在t=3的时候,有两条路径的输出都是a(分别是[a,ϵ]和[ϵ,a]),它们(有可能)是可以合并的。

因此我们可以改进一些Beam Search算法,把相同输出的路径合并起来。这里的合并是把输出里相同的字符变成一个,并且去掉空字符,然后所有相同输出的概率累加起来。

改进后的算法的搜索过程如下图(beam大小为3)。

在t=3的时刻,在下方,[b,a,ϵ] 和 [b,a,a] 被合并成相同的结果[b,a]。另外需要注意的是t=3的时刻,上方[a]在扩展增加a的时候会输出两条路径:[a,a]与[a]。

A proposed extension can map to two output prefixes if the character is a repeat. This is shown at T=3 in the figure above where ‘a’ is proposed as an extension to the prefix [a]. Both [a] and [a, a] are valid outputs for this proposed extension.

When we extend [a] to produce [a,a], we only want include the part of the previous score for alignments which end in ϵ. Remember, the ϵ is required between repeat characters. Similarly, when we don’t extend the prefix and produce [a], we should only include the part of the previous score for alignments which don’t end in ϵ.

Given this, we have to keep track of two probabilities for each prefix in the beam. The probability of all alignments which end in ϵ and the probability of all alignments which don’t end in ϵ. When we rank the hypotheses at each step before pruning the beam, we’ll use their combined scores.

The implementation of this algorithm doesn’t require much code, but it is dense and tricky to get right. Checkout this gist for an example implementation in Python.

In some problems, such as speech recognition, incorporating a language model over the outputs significantly improves accuracy. We can include the language model as a factor in the inference problem.

The function L(Y) computes the length of Y in terms of the language model tokens and acts as a word insertion bonus. With a word-based language model L(Y) counts the number of words in Y. If we use a character-based language model then L(Y) counts the number of characters in Y. The language model scores are only included when a prefix is extended by a character (or word) and not at every step of the algorithm. This causes the search to favor shorter prefixes, as measured by L(Y), since they don’t include as many language model updates. The word insertion bonus helps with this. The parameters \alpha and \beta are usually set by cross-validation.

The language model scores and word insertion term can be included in the beam search. Whenever we propose to extend a prefix by a character, we can include the language model score for the new character given the prefix so far.

三、CTC算法的性质

We mentioned a few important properties of CTC so far. Here we’ll go into more depth on what these properties are and what trade-offs they offer.

1. Conditional Independence 条件独立

CTC的一个缺点是它的条件独立假设。The model assumes that every output is conditionally independent of the other outputs given the input. This is a bad assumption for many sequence to sequence problems.

Say we had an audio clip of someone saying “triple A”. Another valid transcription could be “AAA”. If the first letter of the predicted transcription is ‘A’, then the next letter should be ‘A’ with high probability and ‘r’ with low probability. The conditional independence assumption does not allow for this.

If we predict an ‘A’ as the first letter then the suffix ‘AA’ should get much more probability than ‘riple A’. If we predict ‘t’ first, the opposite should be true.

In fact speech recognizers using CTC don’t learn a language model over the output nearly as well as models which are conditionally dependent. However, a separate language model can be included and usually gives a good boost to accuracy.

The conditional independence assumption made by CTC isn’t always a bad thing. Baking in strong beliefs over output interactions makes the model less adaptable to new or altered domains. For example, we might want to use a speech recognizer trained on phone conversations between friends to transcribe customer support calls. The language in the two domains can be quite different even if the acoustic model is similar. With a CTC acoustic model, we can easily swap in a new language model as we change domains.

2. Alignment Properties

CTC算法不需要训练数据对齐,它会把所有相同输出的对齐合并。虽然CTC要求输入X和输出Y严格对齐,但是具体怎么对齐它并没有在模型层面加任何限制,是把概率比较均匀的分配给所有可能的路径还是把概率集中的分配给某些路径,这是不能确定的。

CTC要求对齐的方式是单调的monotonic alignments,这对于语音识别是合适的假设,但是对于其它的任务,比如机器翻译,这种对齐是不合适的。因为一个不同语言的语序是不同的,比如英语a friend of mine和我的朋友,在英语里,friend在mine之前,但是在汉语里”我的”在”朋友”之前。

CTC的另外一个要求就是输入和输出是多对一的,有的任务可以要求严格的一对一关系,比如词性标注,那CTC也是不合适的。另外它也无法表示输入与输出的多对一的关系。比如在英语中,th是一个音素,一个输入可能要对于th这两个输出,CTC也是无法表示这种关系的。

最后一个就是CTC隐式说明输出一定比输入短,虽然这在语音识别是合理的假设(因为输入都很长),但是其它的任务可能就不一定。

四、CTC与序列模型关系 CTC in Context

In this section we’ll discuss how CTC relates to other commonly used algorithms for sequence modeling.

1. HMMs

Hidden Markov Model (HMM) and CTC are actually quite similar. Understanding the relationship between them will help us understand what advantages CTC has over HMM sequence models and give us insight into how CTC could be changed for various use cases.

CTC HMM: The first two nodes are the starting states and the last two nodes are the final states.

 

References

单篇

CTC理论和实战 - 李理的博客

Sequence Modeling with CTC

  1. Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition  [PDF]
    Chan, W., Jaitly, N., Le, Q.V. and Vinyals, O., 2016. ICASSP.
  2. Exploring Neural Transducers for End-to-End Speech Recognition  [PDF]
    Battenberg, E., Chen, J., Child, R., Coates, A., Gaur, Y., Li, Y., Liu, H., Satheesh, S., Seetapun, D., Sriram, A. and Zhu, Z., 2017.
  3. Connectionist Temporal Classification : Labelling Unsegmented Sequence Data with Recurrent Neural Networks  [PDF]
    Graves, A., Fernandez, S., Gomez, F. and Schmidhuber, J., 2006. Proceedings of the 23rd international conference on Machine Learning, pp. 369--376. DOI: 10.1145/1143844.1143891

汇总: 

深度学习理论与实战:提高篇 - 李理的博客

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/526436.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

15:00面试,15:07就出来了,问的问题有点变态。。。

从小厂出来,没想到在另一家公司又寄了。 到这家公司开始上班,加班是每天必不可少的,看在钱给的比较多的份上,就不太计较了。没想到3月一纸通知,所有人不准加班,加班费不仅没有了,薪资还要降40%…

Stable Diffusion 模型下载:ZavyChromaXL(现实、魔幻)

本文收录于《AI绘画从入门到精通》专栏,专栏总目录:点这里。 文章目录 模型介绍生成案例案例一案例二案例三案例四案例五案例六案例七案例八 下载地址 模型介绍 作者述:该模型系列应该是用于 SDXL 的 ZavyMix SD1.5 模型的延续。主要重点是获…

bug--xxoobject has no attribute xxx

Python 创建类的实例后却不能调用写的方法,检查了半天原来是缩进的问题,def函数不应该和class并列 只能说这个英文空格太小了,看不出来。。。。

P1888 三角函数

题目概述&#xff1a; AC代码&#xff1a; #include<iostream>using namespace std;int gcd(long long a,long long b) {while(a%b!0){int c a%b;a b;b c;}return b; }int main() {long long a,b,c;scanf("%lld %lld %lld",&a,&b,&c);if(a &…

利用华为CodeArts持续交付项目演示流程

软件开发生产线&#xff08;CodeArts&#xff09;是面向开发者提供的一站式云端平台&#xff0c;即开即用&#xff0c;随时随地在云端交付软件全生命周期&#xff0c;覆盖需求下发、代码提交、代码检查、代码编译、验证、部署、发布&#xff0c;打通软件交付的完整路径&#xf…

微信小程序onLoad加载定义好的函数

这里小程序开发中容易犯的错误-1 给客户做一个程序。需要在页面加载的时候在onLoad(options){}中加载定义好的函数&#xff0c;代码如下 onLoad(options) {get_week_()},运行时老报错 后来修改为正确的代码 onLoad(options) {this.get_week_()//必须加this},再尝试运行&#x…

论文笔记 Where Would I Go Next? Large Language Models as Human Mobility Predictor

arxiv 2023 08的论文 1 intro 1.1 人类流动性的独特性 人类流动性的独特特性在于其固有的规律性、随机性以及复杂的时空依赖性 ——>准确预测人们的行踪变得困难近期的研究利用深度学习模型的时空建模能力实现了更好的预测性能 但准确性仍然不足&#xff0c;且产生的结果…

常用的加密算法

AES 高级加密标准&#xff08;AES, Advanced Encryption Standard&#xff09;是当今世界范围内应用最广泛的对称加密算法之一。在微信小程序加密传输等场景中&#xff0c;AES算法发挥着至关重要的作用。对称加密算法的特点在于加密和解密过程使用相同的密钥。具体来说&#x…

音视频按照时长分类小工具

应某用户的需求&#xff0c;编写了这款根据音视频时长分类小工具。 实际效果如下&#xff1a; 显示的是时分秒&#xff1a; 核心代码&#xff1a; MediaInfo MI; if (MI.Open(strPathInput.c_str()) 0){return -1;}_tstring stDuration MI.Get(stream_t::Stream_Audio,0,_T…

重建大师出现引擎在正常跑任务,但工程界面却看不到内容的情况,该怎么解决(如下图)?

这个是工程production信息损坏了&#xff0c;数据还在正常跑&#xff0c;不影响最终成果&#xff0c;也可以新建一个procution跑剩下瓦块。 若上步操作完成后&#xff0c;出现任务无法取消的情况&#xff0c;可以尝试停下所有引擎&#xff0c;然后切换一个路径&#xff0c;不用…

visualization_msgs::Marker 的pose设置,map坐标系的3d box显示问题

3D框显示 3D框显示可以使用visualization_msgs::Marker::LINE_LIST或者LINE_STRIP&#xff0c;前者使用方法需要指明线的两个端点&#xff0c;后者自动连接相邻两个点。 姿态问题 网上看了一些&#xff0c;没有涉及到朝向设置&#xff0c;Pose.orientation默认构造为4个0 至…

深度学习图像算法工程师--面试准备(2)

深度学习面试准备 深度学习图像算法工程师–面试准备&#xff08;1&#xff09; 深度学习图像算法工程师–面试准备&#xff08;2&#xff09; 文章目录 深度学习面试准备前言一、Batch Normalization(批归一化)1.1 具体步骤1.2 BN一般用在网络的哪个部分 二、Layer Normaliza…