starganvc2变声器项目实战及其源码解读-编程知识

1.数据与项目文件解读

数据文件目录如下所示，需要注意的是，我们并不能直接对声音进行建模，而需要对声音数据进行预处理，从而得到一系列数值特征，然后对特征进行建模，特征数据存储到processed文件夹中

2.环境配置

pip install librosa

librosa主要负责声音数据的预处理

pip install pysptk

有些环境需要C环境，需要安装visual studio

pip install pyworld

3.数据预处理与声音特征提取

运行preprocess.py，指定参数--dataset VCC2016

（1）声音信号的预处理

首先，进行16KHZ重采样，即每秒采用16k次
然后，进行预加重，通过来说，高频信号价值更大，于是我们补偿高频信号，让高频信号权重更大一些
分帧，类似时间窗口，得到多个特征段

代码实现：使用librosa进行读取

def load_wavs(dataset: str, sr):"""`data`: contains all audios file path. `resdict`: contains all wav files.   """data = {}with os.scandir(dataset) as it:  # 每个人的声音路径for entry in it:if entry.is_dir():data[entry.name] = []with os.scandir(entry.path) as it_f:for onefile in it_f:if onefile.is_file():data[entry.name].append(onefile.path)print(f'* Loaded keys: {data.keys()}')resdict = {}cnt = 0for key, value in data.items():resdict[key] = {}for one_file in value: #预处理，突出高频信号，因为一般发音的话高频信号能表达更多有用的信息filename = one_file.split('/')[-1].split('.')[0] newkey = f'{filename}'wav, _ = librosa.load(one_file, sr=sr, mono=True, dtype=np.float64)#sr：采样率 mono：单通道y, _ = librosa.effects.trim(wav, top_db=15)wav = np.append(y[0], y[1: ] - 0.97 * y[: -1]) # 预加重resdict[key][newkey] = wavprint('.', end='')cnt += 1print(f'\n* Total audio files: {cnt}.')return resdict

（2）特征汇总

基频特征(FO):声音可以分解成不同频率的正弦波，其中频率最低的那个就是基频特征

频谱包络:语音是一个时序信号，如采样频率为16kHz的音频文件(每秒包含16000个采样点)分后得到了多个子序列，然后对每个子序列进行傅里叶变换操作，就得到了频率-振幅图(也就是描述频率-振幅图变化趋势的）

Aperiadic参数:基于FO与频谱包络计算得到

代码实现：注意使用pyworld实现

def world_features(wav, sr, fft_size, dim, shiftms):f0, timeaxis = pw.harvest(wav, sr, frame_period=shiftms) #语音基频特征 声音一般可以分解为许多单纯的正弦波，所有的自然声音基本都是由许多频率不同的正弦波组成的，其中频率最低的正弦波即为基音sp = pw.cheaptrick(wav, f0, timeaxis, sr, fft_size=fft_size) # 频谱包络ap = pw.d4c(wav, f0, timeaxis, sr, fft_size=fft_size) # aperiodic参数return f0, timeaxis, sp, ap

（3）MFCC

流程:连续语音--预加重--加窗分帧--FFT傅里叶变换--MEL滤波器组--对数运算--DCT

通常来讲，我们人对低频的声音更敏感，例如从100HZ到200HZ，我们明显能够感觉到声音的变化。而如果声音从4000HZ到4100HZ，我们则感觉不到明显的变化。这可以从斜率的角度理解，其图像类似于一个对数函数。

FFT（傅里叶变换）之后就把语音转换到频域，MEL滤波器变换后相当于去模拟人类听觉效果。

最后DCT相当于提取每一帧的包络 (这里面特征多）

代码实现：

def cal_mcep(wav, sr, dim, fft_size, shiftms, alpha):"""Calculate MCEPs given wav singnal."""f0, timeaxis, sp, ap = world_features(wav, sr, fft_size, dim, shiftms)mcep = mcep_from_spec(sp, dim, alpha) #MFCC：连续语音--预加重--加窗分帧--FFT--MEL滤波器组--对数运算--DCTmcep = mcep.Treturn f0, ap, mcep

4.网络结构

（1）生成器网络结构

在生成器中，首先使用2D卷积进行下采样，然后reshape成1D，经过AdaIn和GLU门控单元后，使用1D残差模块进行特征提取。然后再reshape成2D，经过上采样和GLU门控单元后输出生成结果。

代码：

class Generator(nn.Module):def __init__(self, num_speakers=4):super(Generator, self).__init__()self.num_speakers = num_speakersself.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")# Initial layers.self.conv_layer_1 = nn.Sequential(nn.Conv2d(in_channels=1, out_channels=128, kernel_size=(5, 15), stride=(1, 1), padding=(2, 7)),nn.GLU(dim=1))# Down-sampling layers.self.down_sample_1 = DownsampleBlock(dim_in=64,dim_out=256,kernel_size=(5, 5),stride=(2, 2),padding=(2, 2),bias=False)self.down_sample_2 = DownsampleBlock(dim_in=128,dim_out=512,kernel_size=(5, 5),stride=(2, 2),padding=(2, 2),bias=False)# Reshape data (This operation is done in forward function).# Down-conversion layers.self.down_conversion = nn.Sequential(nn.Conv1d(in_channels=2304,out_channels=256,kernel_size=1,stride=1,padding=0,bias=False),nn.InstanceNorm1d(num_features=256, affine=True))# Bottleneck layers.self.residual_1 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_2 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_3 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_4 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_5 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_6 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_7 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_8 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)self.residual_9 = ResidualBlock(dim_in=256,dim_out=512,kernel_size=5,stride=1,padding=2,style_num=self.num_speakers * 2)# Up-conversion layers.self.up_conversion = nn.Conv1d(in_channels=256,out_channels=2304,kernel_size=1,stride=1,padding=0,bias=False)# Reshape data (This operation is done in forward function).# Up-sampling layers.self.up_sample_1 = UpSampleBlock(dim_in=256,dim_out=1024,kernel_size=(5, 5),stride=(1, 1),padding=2,bias=False)self.up_sample_2 = UpSampleBlock(dim_in=128,dim_out=512,kernel_size=(5, 5),stride=(1, 1),padding=2,bias=False)# TODO: The last layer differs from the paper.self.out = nn.Conv2d(in_channels=64,out_channels=1, # 35 in paperkernel_size=(5, 15),stride=(1, 1),padding=(2, 7),bias=False)def forward(self, x, c, c_):c_onehot = torch.cat((c, c_), dim=1).to(self.device)width_size = x.size(3)#print (x.shape)x = self.conv_layer_1(x)#print (x.shape)x = self.down_sample_1(x)#print (x.shape)x = self.down_sample_2(x)#print (x.shape)x = x.contiguous().view(-1, 2304, width_size // 4)#print (x.shape)x = self.down_conversion(x)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_2(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.residual_1(x, c_onehot)#print (x.shape)x = self.up_conversion(x)#print (x.shape)x = x.view(-1, 256, 9, width_size // 4)#print (x.shape)x = self.up_sample_1(x)#print (x.shape)x = self.up_sample_2(x)#print (x.shape)out = self.out(x)#print (out.shape)out_reshaped = out[:, :, : -1, :]#print (out.shape)return out_reshaped

（2）标签的处理

首先，我们将sourse和target类别的one-hot向量进行拼接，拼接后的向量会用于AdaIn中权重参数和偏置项的学习。

（3）判别器网络结构

对于输入特征x，首先经过卷积和GLU门单元。然后进行下采样，最后FC层得到B*1的向量

标签的处理：首先每个domain进行one hot编码，得到B*d的编码向量，然后将sourse和target进行拼接。拼接后编码为B*C的向量。而GSP层会将输出向量B*C*H*W压成B*C的向量，最后和标签得到的向量内积得到B*C的向量，对最终结果在sum一下得到B*1的向量，然后加入经过FC层的B*1的向量x中，最终得到预测值

代码

class Discriminator(nn.Module):def __init__(self, num_speakers=4):super(Discriminator, self).__init__()self.num_speakers = num_speakersself.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")# Initial layers.self.conv_layer_1 = nn.Sequential(nn.Conv2d(in_channels=1, out_channels=128, kernel_size=(3, 3), stride=(1, 1), padding=1),nn.GLU(dim=1))self.conv_gated_1 = nn.Sequential(nn.Conv2d(in_channels=1, out_channels=128, kernel_size=(3, 3), stride=(1, 1), padding=1),nn.GLU(dim=1))# Down-sampling layers.self.down_sample_1 = DownsampleBlock(dim_in=64,dim_out=256,kernel_size=(3, 3),stride=(2, 2),padding=1,bias=False)self.down_sample_2 = DownsampleBlock(dim_in=128,dim_out=512,kernel_size=(3, 3),stride=(2, 2),padding=1,bias=False)self.down_sample_3 = DownsampleBlock(dim_in=256,dim_out=1024,kernel_size=(3, 3),stride=(2, 2),padding=1,bias=False)self.down_sample_4 = DownsampleBlock(dim_in=512,dim_out=1024,kernel_size=(1, 5),stride=(1, 1),padding=(0, 2),bias=False)# Fully connected layer.self.fully_connected = nn.Linear(in_features=512, out_features=1)# Projection.self.projection = nn.Linear(self.num_speakers * 2, 512)def forward(self, x, c, c_):c_onehot = torch.cat((c, c_), dim=1).to(self.device)#print (x.shape)x = self.conv_layer_1(x) * torch.sigmoid(self.conv_gated_1(x))#print (x.shape)x = self.down_sample_1(x)#print (x.shape)x = self.down_sample_2(x)#print (x.shape)x = self.down_sample_3(x)#print (x.shape)x_ = self.down_sample_4(x)#print (x.shape)h = torch.sum(x_, dim=(2, 3)) # sum pooling#print (h.shape)x = self.fully_connected(h)#print (x.shape)p = self.projection(c_onehot)#print (p.shape)x += torch.sum(p * h, dim=1, keepdim=True)#print (x.shape)return x

5.损失函数

（1）分类损失

给定原始图像和生成的图像，预测对应的domain标签

（2）Cycle-consistency loss

类似于cycle gan，我们需要保证语音的文本内容不变，因此，我们再将生成的语音进行还原，还原的语音需要与原始语言足够相似。

（3）Identity-mapping loss

输入某段语音，其domain为c,如果我们就希望这段语音转换为其原来domain的语音，那么转换后应该与其自身足够接近。

（4）Adversarial loss

对抗损失原语音输出概率接近于1，而真实语言输出概率接近于0，但是这里不仅输入了sourse domain的编码，还输入了target domain的编码。标签处理见上文

6.模型的训练与测试

训练指定参数：--dataset VCC2016

转换指定参数：--mode convert --src_speaker "VCC2SM1" --trg_speaker "['VCC2SM1', 'VCC2SF1']" --test_iters 100000 --dataset VCC2018

数据与代码链接：https://pan.baidu.com/s/1aNlghgo6mtD4iWqNgMOWOQ?pwd=s206
提取码：s206