HugginFace 中文填空(学习笔记)

HugginFace 中文填空(学习笔记)
HugginFace 中文填空(学习笔记)
数据集介绍
实现代码
安装环境
准备数据集
使用编码工具
定义数据集
定义计算设备
定义数据整理函数
定义数据集加载器
定义模型
加载预训练模型
定义下游任务模型
训练和测试
测试

数据集介绍

本章使用的仍然是情感分类数据集，每条包括一句购物评价一集以及是不是好评的标识。
唯一不同的是将其中的句子字符串进行替换，然后通过模型进行训练后进行填空。

实现代码

安装环境

%pip install -q transformers datasets torchtext

准备数据集

使用编码工具

首先需要加载编码工具。

from transformers import BertTokenizer
token = BertTokenizer.from_pretrained('bert-base-chinese')
token

进行试算。

out = token.batch_encode_plus(
    batch_text_or_text_pairs=['轻轻的我走了，正如我轻轻地来。', '我轻轻的招手，作别西天的云彩。'],
    truncation=True,
    padding='max_length',
    max_length=18,
    return_tensors='pt',
    return_length=True)
#查看编码输出
for k, v in out.items():
    print(k, v.shape)
#把编码还原为句子
print(token.decode(out['input_ids'][0]))

定义数据集

本次任务中，依然将使用ChnSentiCorp数据集，但需要对数据集进行一些操作，它变成一个填空任务数据集。
在开始处理之前，首先需要加载数据集，代码如下：

from datasets import load_dataset
dataset = load_dataset('lansinuote/ChnSentiCorp')
dataset

接下来对这些文本数据进行编码，便于后续的处理，代码如下：

def f(data):
    return token.batch_encode_plus(batch_text_or_text_pairs=data['text'],
                                   truncation=True,
                                   padding='max_length',
                                   max_length=30,
                                   return_length=True)
# 丢掉'text', 'label'字段
dataset = dataset.map(function=f,
                      batched=True,
                      batch_size=1000,
                      num_proc=4,
                      remove_columns=['text', 'label'])
dataset

truncation=True和max_length=30编码结果的长度不会长于30个词，超出30个词的部分会被截断。
padding=max_length表明不足会补充[PAD]到30个词为止。
return_length=True会让编码结果中多出一个length字段，表明这段数据的长度，由于PAD不会被计算在长度内，所以length一定小于或等于30,这个字段方便了后续数据过滤。

接下来我们将丢弃句子长度小于30个词的句子给过滤掉。

def f(data):
    return [i >= 30 for i in data['length']]
dataset = dataset.filter(function=f, batched=True, batch_size=1000, num_proc=4)
dataset

训练集中少了314条数据，测试集合中少了47条数据。

定义计算设备

import torch
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

定义数据整理函数

我们把第15个词给挖出来，替换成[MASK]同时擦掉正确的答案。
模型通过[MASK]将其中进行预测出来。

def collate_fn(data):
    #取出编码结果
    input_ids = [i['input_ids'] for i in data]
    attention_mask = [i['attention_mask'] for i in data]
    token_type_ids = [i['token_type_ids'] for i in data]
    #转换为tensor格式
    input_ids = torch.LongTensor(input_ids)
    attention_mask = torch.LongTensor(attention_mask)
    token_type_ids = torch.LongTensor(token_type_ids)
    #把第15个词替换为mask
    labels = input_ids[:, 15].reshape(-1).clone()
    input_ids[:, 15] = token.get_vocab()[token.mask_token]
    #移动到计算设备
    input_ids = input_ids.to(device)
    attention_mask = attention_mask.to(device)
    token_type_ids = token_type_ids.to(device)
    labels = labels.to(device)
    return input_ids, attention_mask, token_type_ids, labels

试算。

data = [{
    'input_ids': [
        101, 2769, 3221, 3791, 6427, 1159, 2110, 5442, 117, 2110, 749, 8409,
        702, 6440, 3198, 4638, 1159, 5277, 4408, 119, 1728, 711, 2769, 3221,
        5439, 2399, 782, 117, 3791, 102
    ],
    'token_type_ids': [0] * 30,
    'attention_mask': [1] * 30
}, {
    'input_ids': [
        101, 679, 7231, 8024, 2376, 3301, 1351, 6848, 4638, 8024, 3301, 1351,
        3683, 6772, 4007, 2692, 8024, 2218, 3221, 100, 2970, 1366, 2208, 749,
        8024, 5445, 684, 1059, 3221, 102
    ],
    'token_type_ids': [0] * 30,
    'attention_mask': [1] * 30
}]
#试算
input_ids, attention_mask, token_type_ids, labels = collate_fn(data)
#把编码还原为句子
print(token.decode(input_ids[0]))
print(token.decode(labels[0]))
input_ids.shape, attention_mask.shape, token_type_ids.shape, labels

定义数据集加载器

loader = torch.utils.data.DataLoader(dataset=dataset['train'],
                                     batch_size=16,
                                     collate_fn=collate_fn,
                                     shuffle=True,
                                     drop_last=True)
len(loader)

for i, (input_ids, attention_mask, token_type_ids,
        labels) in enumerate(loader):
    break
print(token.decode(input_ids[0]))
print(token.decode(labels[0]))
input_ids.shape, attention_mask.shape, token_type_ids.shape, labels

定义模型

加载预训练模型

# 加载预训练模型
from transformers import BertModel
pretrained = BertModel.from_pretrained('bert-base-chinese')
#统计参数量
sum(i.numel() for i in pretrained.parameters()) / 10000

该模型有大约1个亿的参数量。

# 不训练预训练模型,不需要计算梯度
for param in pretrained.parameters():
    param.requires_grad_(False)

预训练的之后，可以进行一次试算。

#预训练模型试算
#设定计算设备
pretrained.to(device)
#模型试算
out = pretrained(input_ids=input_ids,
                 attention_mask=attention_mask,
                 token_type_ids=token_type_ids)
out.last_hidden_state.shape

这里输出了16句话的结果，每句话包括了30个词，每个词被抽成了768维的向量。

定义下游任务模型

#定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.decoder = torch.nn.Linear(in_features=768,
                                       out_features=token.vocab_size,
                                       bias=False)
        #重新初始化decode中的bias参数为全0
        self.bias = torch.nn.Parameter(data=torch.zeros(token.vocab_size))
        self.decoder.bias = self.bias
        #定义Dropout层，防止过拟合
        self.dropout = torch.nn.Dropout(p=0.5)
    def forward(self, input_ids, attention_mask, token_type_ids):
        #使用预训练模型抽取数据特征
        with torch.no_grad():
            out = pretrained(input_ids=input_ids,
                             attention_mask=attention_mask,
                             token_type_ids=token_type_ids)
        #把第15个词的特征，投影到全字典范围内
        out = self.dropout(out.last_hidden_state[:, 15])
        out = self.decoder(out)
        return out
model = Model()
#设定计算设备
model.to(device)
#试算
model(input_ids=input_ids,
      attention_mask=attention_mask,
      token_type_ids=token_type_ids).shape

在这段代码中，定义了下游任务模型，该模型只包括一个全连接的线性神经网络，权重矩阵为768x21128，所以它能够把一个768维度的向量转换到21128维空间中。
可以把backbone抽取的数据特征还原为字典中的任何一个字。型做分类下游任务模型的计算过程为，获取一批数据之后，使用backbone将这批数据抽取成特征矩阵,抽取的特征矩阵的形状应该是16x30x768,这在之前预训练模型的试算中已经看到。这3个维度分别代表了16句话、30个词、768 维度的特征向量。
是对baci 在本次的填空任务中，填空处固定出现在每句话的第15个词的位置，所以只取出每句来讲，需话的第15个词的特征，再尝试把这个词的特征投影到全体词表空间中，即还原为词典中的结果为数据某个词。
在投影到全体词表空间中时，由于768x21128是一个很大的矩阵，如果直接计算，则很容易导致过拟合，所以对backbone抽取的数据特征要接入一个DropOut 网络，把其中的数据以一定的概率置为0，防止网络的过拟合。
在代码的最后对该模型进行了试算，运行结果如下:

可见，预测结果为16句的填空结果，如果在该结果上再套用Softmax()函数，则为在全体词表中每个词的概率。

训练和测试

训练模型代码如下：

from transformers import AdamW
from transformers.optimization import get_scheduler
def train():
    #定义优化器
    optimizer = AdamW(model.parameters(), lr=5e-4, weight_decay=1.0)
    #定义loss函数
    criterion = torch.nn.CrossEntropyLoss()
    #定义学习率调节器
    scheduler = get_scheduler(name='linear',
                              num_warmup_steps=0,
                              num_training_steps=len(loader) * 5,
                              optimizer=optimizer)
    #模型切换到训练模式
    model.train()
    #共训练5个epoch
    for epoch in range(5):
        #按批次遍历训练集中的数据
        for i, (input_ids, attention_mask, token_type_ids,
                labels) in enumerate(loader):
            #模型计算
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
            #计算loss并使用梯度下降法优化模型参数
            loss = criterion(out, labels)
            loss.backward()
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            #输出各项数据的情况，便于观察
            if i % 50 == 0:
                out = out.argmax(dim=1)
                accuracy = (out == labels).sum().item() / len(labels)
                lr = optimizer.state_dict()['param_groups'][0]['lr']
                print(epoch, i, loss.item(), lr, accuracy)
train()

测试

def test():
    #定义测试数据集加载器
    loader_test = torch.utils.data.DataLoader(dataset=dataset['test'],
                                              batch_size=32,
                                              collate_fn=collate_fn,
                                              shuffle=True,
                                              drop_last=True)
    #下游任务模型切换到运行模式
    model.eval()
    correct = 0
    total = 0
    #按批次遍历测试集中的数据
    for i, (input_ids, attention_mask, token_type_ids,
            labels) in enumerate(loader_test):
        #计算15个批次即可，不需要全部遍历
        if i == 15:
            break
        print(i)
        #计算
        with torch.no_grad():
            out = model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids)
        #统计正确率
        out = out.argmax(dim=1)
        correct += (out == labels).sum().item()
        total += len(labels)
    print(correct / total)
test()

测试出的正确率如下：

HugginFace 中文填空(学习笔记)

数据集介绍

实现代码

安装环境

准备数据集

使用编码工具

定义数据集

定义计算设备

定义数据整理函数

定义数据集加载器

定义模型

加载预训练模型

定义下游任务模型

训练和测试

测试

HugginFace 初探

HugginFace 使用编码工具(学习笔记)

HugginFace 使用数据集(学习笔记)

HugginFace 使用评价指标工具(学习笔记)

HugginFace 使用管道工具(学习笔记)

HugginFace 使用训练工具(学习笔记)

HugginFace 中文情感分类(学习笔记)

HugginFace 中文数据关系推断(学习笔记)

HugginFace 中文命名实体识别(学习笔记)