0%

大模型的使用与下游任务的微调

该篇是的大模型的使用与下游任务的微调,踩在巨人肩膀上的一点小小尝试。目前,很多主流大模型都已开源,这种促进人类进步发展的奉献精神令人敬佩!

模型的使用与下游任务的微调

任务1:多模态大模型-fuyu_8b

该模型具备强大的图像理解能力,能理解照片、图表、PDF、界面UI等,且处理速度很快,研究团队表示100毫秒内可反馈大图像处理结果。同时它还很“轻巧”,模型规模没超百亿,且没有使用图像编码器。

1.模型结构

1703141870226

模型采用了Transformer的Decoder部分,输入是图片的patch与句子向量。模型的参数量为8B,需要1~2张3090进行训练。

2.模型调用

1
2
3
4
5
6
7
8
from transformers import FuyuProcessor, FuyuForCausalLM
from PIL import Image
import requests

# load model and processor
model_id = "F:/LLMs/模型2_多模态大模型/fuyu_8b"
processor = FuyuProcessor.from_pretrained(model_id)
model=FuyuForCausalLM.from_pretrained(model_id,device_map="sequebtial",torch_dtype=torch.bfloat16) #device_map有不同设置觉得了模型在GPU上的不同部署。torch_dtype=torch.bfloat16让模型更轻量。

3.模型输入

1
2
3
4
text_prompt = "Generate a coco-style caption.\n"  #文本信息
url = "F:/LLMs/模型2_多模态大模型/fuyu_8b/bus.png" #图片文件
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=text_prompt, images=image, return_tensors="pt").to("cuda:0")

4.模型输出

1
2
3
4
5
generation_output = model.generate(**inputs, max_new_tokens=7)
generation_text=processor.batch_decode(generation_output[:,-7:],skip_special_tokens=True)
#此处的7是设置输出的长短,不设置默认20
print(generation_text)
assert generation_text == ['A blue bus parked on the side of a road.']#评估生成答案与实际的一致性

5.使用案例

1)生成式人脸的判断

这里使用AI人脸让模型做出判断,问的问题分为两种,一种是让模型做选择题,一种是做判断题。输入、输出以及回答的情况如下所示。

图片

1703149844066

这是一张由StyleGN2生成的人脸图片。

问答
1
2
3
4
5
6
Q:这张人脸是生成的吗?
A:Yes,the man's face is arttificially created. #回答的比较正常

Q: 这张人脸是真实存在的还是生成的?
A:A man's face is smiling,with his eyes closed,and he is wearing a hat.However, the image appears to be photoshopped, as there is no visible facial hair, and his eyes are closed.
#描述了这张脸,用了似乎是P过的字眼,理由是没有胡子以及眼睛是闭着的(但其实是睁着的)。
总结

对于生成式人脸的判断,在给模型做判断题时(是或不是),给出了答案。但是也有可能给他一张真实的人脸,他也会回答这是生成的。在给模型做选择题时(是哪一个),模型首先给出了对整张图片的描述,然后给出了一个较模糊的答案。描述的部分有点偏理实际。因此,在做该任务时,需要数据集进行微调。

2)金融图片问答与描述

使用模型对金融图片进行描述与题问:

图片

1703143983419

问答
1
2
3
4
5
Q:详细解读这张图片中的信息。
A:The graph displays how China's GDP has fluctuated over 20 years.In 1995, China's GDP was 100 billion U.S. dollars.By 2012,China's GDP had grown to 207.1 billion U.S. dollars.In 2017, China's GDP was 207.0 billion U.S. dollars. #一些时间节点时对的,但数据不知道是哪里的,可能是预训练权重中的。

Q:中央增长率哪一年最高?
A:1995.

可能是这张图片对于模型来说确实复杂了,于是换了一个英文的男女出生率的图表。

图片

1703144559362

问答:
1
2
Q:男性出生率最高的一年是哪一年?
A:In 2012,the highest life expectancy for men worldwide was 80.2 years.

总结:对于中午的复杂图表,模型只能抓住主要的时间节点以及主题信息,其他信息解读不出来。对于英文的图表,模型效果明显,回答的较好。

3)图片信息提问与描述

使用模型对图片中的信息进行回答:

图片1

1703144837220

问答1
1
2
Q:图片中是哪座桥?
A:The image features the Golden Gate Bridge,which is a prominent suspension bridge spanning the Gloden Gate strait in San Francisco,California.
图片2

1703146286060

问答2
1
2
Q:这个男孩在干嘛?
A:The young man is skateboading in a skate park,peforming tricks and jumps.
图片3

1703146449077

问答3
1
2
Q:这辆车是什么车?
A:The red spors car is a Maserati.
总结

在常规的问题提问这一块,该模型表现出了十分优异的性能。

任务2:大模型的微调——新闻分类

参考:保姆级教程,用PyTorch和BERT进行文本分类 - 知乎 (zhihu.com)

目前已经有很多预训练完成的大模型,它们很好的解决了各类问题。但是,如果在实际应用中将这些模型适配到合适的场景则需要对模型进行微调。

1.模型结构

谷歌就为 NLP 应用程序开发了一个基于 Transformer 的强大的机器学习模型,该模型在不同的基准数据集中优于以前的语言模型。这个模型被称为BERT。

BERT 架构由多个堆叠在一起的 Transformer 编码器组成。每个 Transformer 编码器都封装了两个子层:一个自注意力层和一个前馈层。

1703147040770

2.模型调用

1
2
3
4
5
6
7
8
9
from transformers import BertModel,BertTokenizer
BERT_PATH = 'F:/LLMs/bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(BERT_PATH)
print(tokenizer.tokenize('I have a good time, thank you.'))
bert = BertModel.from_pretrained(BERT_PATH)
print('load bert model over')
输出:
['I', 'have', 'a', 'good', 'time', ',', 'thank', 'you', '.']
load bert model over
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('F:/LLMs/bert-base-cased')
example_text = 'I will watch Memento tonight'
bert_input = tokenizer(example_text,padding='max_length',
max_length = 10,
truncation=True,
return_tensors="pt")
print(bert_input['input_ids'])
print(bert_input['token_type_ids'])
print(bert_input['attention_mask'])
输出:
tensor([[ 101, 146, 1209, 2824, 2508, 26173, 3568, 102, 0, 0]])
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

3. 微调

调用库

1
2
3
4
5
6
import torch
import numpy as np
from transformers import BertTokenizer,BertModel
from torch import nn
import pandas as pd
import numpy as np

调用分词器设置标签

1
2
3
4
5
6
7
tokenizer = BertTokenizer.from_pretrained('F:/LLMs/bert-base-cased')
labels = {'business':0,
'entertainment':1,
'sport':2,
'tech':3,
'politics':4
}

制作数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class Dataset(torch.utils.data.Dataset):
def __init__(self, df):
self.labels = [labels[label] for label in df['category']]
self.texts = [tokenizer(text,
padding='max_length',
max_length = 512,
truncation=True,
return_tensors="pt")
for text in df['text']]

def classes(self):
return self.labels

def __len__(self):
return len(self.labels)

def get_batch_labels(self, idx):
# Fetch a batch of labels
return np.array(self.labels[idx])

def get_batch_texts(self, idx):
# Fetch a batch of inputs
return self.texts[idx]

def __getitem__(self, idx):
batch_texts = self.get_batch_texts(idx)
batch_y = self.get_batch_labels(idx)
return batch_texts, batch_y

模型修改:回归任务改为分类任务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class BertClassifier(nn.Module):
def __init__(self, dropout=0.5):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('F:/LLMs/bert-base-cased')
self.dropout = nn.Dropout(dropout)
self.linear = nn.Linear(768, 5)
self.relu = nn.ReLU()

def forward(self, input_id, mask):
_, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
dropout_output = self.dropout(pooled_output)
linear_output = self.linear(dropout_output)
final_layer = self.relu(linear_output)
return final_layer

训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def train(model, train_data, val_data, learning_rate, epochs):
# 通过Dataset类获取训练和验证集
train, val = Dataset(train_data), Dataset(val_data)
# DataLoader根据batch_size获取数据,训练时选择打乱样本
train_dataloader = torch.utils.data.DataLoader(train, batch_size=2, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val, batch_size=2)
# 判断是否使用GPU
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=learning_rate)

if use_cuda:
model = model.cuda()
criterion = criterion.cuda()
# 开始进入训练循环
for epoch_num in range(epochs):
# 定义两个变量,用于存储训练集的准确率和损失
total_acc_train = 0
total_loss_train = 0
# 进度条函数tqdm
for train_input, train_label in tqdm(train_dataloader):

train_label = train_label.to(device)
mask = train_input['attention_mask'].to(device)
input_id = train_input['input_ids'].squeeze(1).to(device)
# 通过模型得到输出
output = model(input_id, mask)
# 计算损失
batch_loss = criterion(output, train_label)
total_loss_train += batch_loss.item()
# 计算精度
acc = (output.argmax(dim=1) == train_label).sum().item()
total_acc_train += acc
# 模型更新
model.zero_grad()
batch_loss.backward()
optimizer.step()
# ------ 验证模型 -----------
# 定义两个变量,用于存储验证集的准确率和损失
total_acc_val = 0
total_loss_val = 0
# 不需要计算梯度
with torch.no_grad():
# 循环获取数据集,并用训练好的模型进行验证
for val_input, val_label in val_dataloader:
# 如果有GPU,则使用GPU,接下来的操作同训练
val_label = val_label.to(device)
mask = val_input['attention_mask'].to(device)
input_id = val_input['input_ids'].squeeze(1).to(device)

output = model(input_id, mask)

batch_loss = criterion(output, val_label)
total_loss_val += batch_loss.item()

acc = (output.argmax(dim=1) == val_label).sum().item()
total_acc_val += acc

print(
f'''Epochs: {epoch_num + 1}
| Train Loss: {total_loss_train / len(train_data): .3f}
| Train Accuracy: {total_acc_train / len(train_data): .3f}
| Val Loss: {total_loss_val / len(val_data): .3f}
| Val Accuracy: {total_acc_val / len(val_data): .3f}''')
torch,save(model.state_dicr(),'./weight.pth')

主程序

1
2
3
4
5
6
7
8
9
10
11
if __name__=='__main__':
#划分数据集
bbc_text_df = pd.read_csv('F:/LLMs/dataset/bbc-text.csv')
df = pd.DataFrame(bbc_text_df)
np.random.seed(112)
df_train, df_val, df_test = np.split(df.sample(frac=1, random_state=42),
[int(.8*len(df)), int(.9*len(df))])
print(len(df_train),len(df_val), len(df_test))
#训练
model = BerClassifiee()
train(model,df_train,df_val,1e-6,5)

4.测试结果

这里使用自己在网上找到的新闻来测试代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
import numpy as np
from transformers import BertTokenizer,BertModel
from torch import nn
import pandas as pd
import numpy as np

class BertClassifier(nn.Module):
def __init__(self, dropout=0.5):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('F:/LLMs/bert-base-cased')
self.dropout = nn.Dropout(dropout)
self.linear = nn.Linear(768, 5)
self.relu = nn.ReLU()

def forward(self, input_id, mask):
_, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
dropout_output = self.dropout(pooled_output)
linear_output = self.linear(dropout_output)
final_layer = self.relu(linear_output)
return final_layer

def evaluate(model, test_data):
model = BertClssifier()
tokenizer = BertTokenizer.from_pretrained('F:/LLMs/bert-base-cased')
text = '新闻的输入'
input = [tokenizer(text,
padding='max_length',
max_length = 512,
truncation=True,
return_tensors="pt")]

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
if use_cuda:
model = model.cuda()
model.load_state_dict(torch.load('./weight_path'),False)
model.eval()

with torch.no_grad():
mask = test_input['attention_mask'].to(device)
input_id = test_input['input_ids'].squeeze(1).to(device)
output = model(input_id, mask)
print(output)

输入:

On April 4, General Secretary Xi Jinping stressed during his inspection tour of Tsinghua University that he should adhere to the goal of building a world-class university with Chinese characteristics and contribute to serving the prosperity of the country and the rejuvenation of the nation and the happiness of the people. On December 19, General Secretary Xi Jinping presided over the 12rd meeting of the Central Committee for Comprehensively Deepening Reform to deliberate and adopt the “Several Opinions on Deepening the Construction of World-class Universities and First-class Disciplines”.(习大大考察清华以及审议通过《关于深化建设世界一流大学和一流学科的若干意见》)

输出:

0(business)这个判断错误:应该是教育

输入

On December 12, Beijing time, the Portland Trail Blazers hosted the Phoenix Suns, and the two teams fought fiercely for four quarters in this game, and finally the Suns lost to the Trail Blazers 20-104.Kevin Durant scored 40 points, plus five assists and four rebounds, Booker had 5 points, seven assists and three rebounds, Allen had 4 points and nine rebounds, and Nurkic had nine points, 26 rebounds, three assists, two steals and two blocks.Simmons had 23 points, 7 assists and 3 rebounds, Grant had 22 points, 4 assists and 2 blocks, Ayton had 16 points, 15 rebounds and 3 assists, and Brogdon had 14 points, 4 rebounds and 4 assists.After the opening of the game, Durant made consecutive shots to help the Suns rebound, and they established a 16-point advantage in the first quarter.However, as the game progressed, Simmons began to exert his power in the second half, leading the team to a 38-20 attack wave in a single quarter, and achieved a comeback in one fell swoop. The two teams battled fiercely for 12 minutes in the final quarter, but the Suns still failed to complete the comeback, and finally lost to the Trail Blazers 104-109.(NBA最新的球赛)

输出:

2(Sport)

输入:

Wall Street noted that among the top ten U.S. bond holding countries and regions announced by TIC, including Chinese mainland, a total of four reductions were reduced in October, Belgium, which ranked seventh, reduced its holdings by $10.316 billion, Luxembourg, which held fourth, reduced its holdings by $282.44 billion, Switzerland, which held ninth place, reduced its holdings by $200.100 billion, and among the six countries and regions that increased their holdings, the United Kingdom and Japan both increased their holdings by more than $<> billion, and the others increased their holdings by less than $<> billion.(华尔街)

输出:

0(business)

总结:整体看,在提供的数据集上,模型在测试集的准确率高达0.991,在自己搜集的新闻上,3个对了2个。

1703223209376

任务3:大模型的微调——影评分类

参考Huggingface 超详细介绍 - 知乎 (zhihu.com)

1.较任务2修改的代码段

和任务2类似,主要是数据集的不同。该任务使用的数据集为aclImdb(斯坦福大学影评数据集,数据集地址:http://ai.stanford.edu/~amaas/data/sentiment/ 下载后解压,会看到有两个文件夹,test和train)。针对该影评数据集需要修改加载数据集部分的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class ImdbDataset(Dataset):
def __init__(self, mode, testNumber=10000, validNumber=5000):

# 在这里我做了设置,把数据集分成三种形式,可以选择 “train”默认返回全量50000个数据,“test”默认随机返回10000个数据,
# 如果是选择“valid”模式,随机返回相应数据
super(ImdbDataset, self).__init__()

# 读取所有的训练文件夹名称
text_path = [os.path.join(data_base_path, i) for i in ["test/neg", "test/pos"]]
text_path.extend([os.path.join(data_base_path, i) for i in ["train/neg", "train/pos"]])

if mode == "train":
self.total_file_path_list = []
# 获取训练的全量数据,因为50000个好像也不算大,就没设置返回量,后续做sentence的时候再做处理
for i in text_path:
self.total_file_path_list.extend([os.path.join(i, j) for j in os.listdir(i)])
if mode == "test":
self.total_file_path_list = []
# 获取测试数据集,默认10000个数据
for i in text_path:
self.total_file_path_list.extend([os.path.join(i, j) for j in os.listdir(i)])
self.total_file_path_list = sample(self.total_file_path_list, testNumber)

if mode == "valid":
self.total_file_path_list = []
# 获取验证数据集,默认5000个数据集
for i in text_path:
self.total_file_path_list.extend([os.path.join(i, j) for j in os.listdir(i)])
self.total_file_path_list = sample(self.total_file_path_list, validNumber)

def tokenize(self, text):

# 具体要过滤掉哪些字符要看你的文本质量如何

# 这里定义了一个过滤器,主要是去掉一些没用的无意义字符,标点符号,html字符啥的
fileters = ['!', '"', '#', '$', '%', '&', '\(', '\)', '\*', '\+', ',', '-', '\.', '/', ':', ';', '<', '=', '>',
'\?', '@'
, '\[', '\\', '\]', '^', '_', '`', '\{', '\|', '\}', '~', '\t', '\n', '\x97', '\x96', '”', '“', ]
# sub方法是替换
text = re.sub("<.*?>", " ", text, flags=re.S) # 去掉<...>中间的内容,主要是文本内容中存在<br/>等内容
text = re.sub("|".join(fileters), " ", text, flags=re.S) # 替换掉特殊字符,'|'是把所有要匹配的特殊字符连在一起
return text # 返回文本

def __getitem__(self, idx):
cur_path = self.total_file_path_list[idx]
# 返回path最后的文件名。如果path以/或\结尾,那么就会返回空值。即os.path.split(path)的第二个元素。
# cur_filename返回的是如:“0_3.txt”的文件名
cur_filename = os.path.basename(cur_path)
# 标题的形式是:3_4.txt 前面的3是索引,后面的4是分类
# 如果是小于等于5分的,是负面评论,labei给值维1,否则就是1
labels = []
sentences = []
if int(cur_filename.split("_")[-1].split(".")[0]) <= 5:
label = 0
else:
label = 1
# temp.append([label])
labels.append(label)
text = self.tokenize(open(cur_path, encoding='UTF-8').read().strip()) # 处理文本中的奇怪符号
sentences.append(text)
# 可见我们这里返回了一个list,这个list的第一个值是标签0或者1,第二个值是这句话;
return sentences, labels

def __len__(self):
return len(self.total_file_path_list)

当然,因为是一个2分类任务,所以5分类也需要修改为2分类。

2.测试结果

最终测试集10000条的准确率为0.998:

1703228353443

这里输入自己打印的评论:

输入:

最新电视剧:《三大队》:

秦昊这几年真的是开了挂了,迷雾出品的这品质也是一年比一年顶,这次这部《三大队》,不论从演员配置还是剧情水准上都是一如既往的稳,尤其这个“神仙打架”的演员阵容,任谁看了不得大呼一句“卧槽”!

Qin Hao has really been hanging up in the past few years, and the quality of the mist production is also getting better and better year by year, this time this “Three Teams”, both in terms of actor configuration and plot level, is as stable as ever, especially the cast of this “fairy fight”, no one can shout “oh my god”!

输出:

1(表示积极)

输入:

最新电视剧:《三大队》:

秦昊这几年真的是开了挂了,迷雾出品的这品质也是一年比一年顶,这次这部《三大队》,不论从演员配置还是剧情水准上都是一如既往的稳,尤其这个“神仙打架”的演员阵容,任谁看了不得大呼一句“卧槽”!

Qin Hao has really been hanging up in the past few years, and the quality of the mist production is also getting better and better year by year, this time this “Three Teams”, both in terms of actor configuration and plot level, is as stable as ever, especially the cast of this “fairy fight”, no one can shout “oh my god”!

输出:

1(表示积极)

输入:

最新电影:《拿破仑》

法国人争争气呢?别的也就算了,把拿破仑交给英国人拍,几乎每个人物都很无趣——每个历史关键节点的人物动机、决策因素与不可抗拒的历史进程作用在个体时的张力完全都没有体现。

What about the French fighting? Forget anything else, handing over Napoleon to the British, almost every character is boring - the tension between the motivations, decision-making factors, and irresistible historical processes at each key point in history is completely unrepresented.

输出:

0(表示消极)

总结:

在训练中,训练集一共设置了50000条(验证级5000),测试集10000条。较多的数据量使得模型较好的完成了预期中的任务。

-------------本文结束感谢您的阅读-------------