柚子的挖挖机

RAG单选题-大模型微调实验

2024-12-04

一、背景

最近参加一个竞赛：大模型根据规则内容做单选题。
比赛的时候利用纯工程方式实现（RAG单选题-技术方案），并没有考虑使用模型微调（GPU服务器资源紧张，性价比一般来说比工程方式低）。
现进行复盘，基于羊毛薅不秃就往秃里薅的基本原则，赛事官方给的标注数据利用起来，进行一些实验。为后续积累经验。
利用竞赛官方标注数据，进行模型微调，对比测试几种微调方式（lora sft、lora dpo、full fintune）的效果。

二、实验结论

进行了几组模型微调之后的对比测试实验，在此类选择题推理场景下，结论如下

prompt选择方面：sft类型prompt效果比较好（任务描述放system里）
模型微调方面：效果 lora sft > lora dpo（大多数情况）
训练数据量方面：本次标注数据量还是太少了（400），全参微调效果不好
对于尺寸小的模型（1.5B），baseline效果不好（0.72）的模型，通过lora微调往往能有比较大的收益（0.9）
基模选择方面：intern2.5 7b chat为同尺寸类型中指标最优
1. 第一梯队：intern2.5 7b chat ，qwen2 7b instruct
1. 第二梯队：Yi-1.5-6B-Chat，llama 3.1 8b instruct，chatglm3-6b

三、实验数据详情

序号	模型		accuracy（dev 100）	备注
1	Qwen/Qwen2-7B-Instruct		0.92(sft prompt 0.93)	baseline
2	Qwen/Qwen2-7B-Instruct-lora-dpo	400，单卡v100，1h50min	0.92 (sft prompt 0.93)
3	Qwen/Qwen2-7B-Instruct-lora-sft	400，单卡v100，1h30min	0.91 (sft prompt 0.9)
4	Qwen/Qwen2-7B-Instruct-full-sft	400，双卡H20，7min	0.89 (sft prompt 0.89)
5	Qwen/Qwen2-1.5B-Instruct		0.72 (sft prompt 0.72)	baseline
6	Qwen/Qwen2-1.5B-Instruct-lora-dpo	400，单卡v100，30min	0.81 (sft prompt 0.85)
7	Qwen/Qwen2-1.5B-Instruct-lora-sft	400，单卡v100，30min	0.9 (sft prompt 0.89)
8	Qwen/Qwen2-1.5B-Instruct-full-sft	400，双卡v100，30min	0.88 (sft prompt 0.9)
9	Qwen/Qwen2-1.5B-Instruct-full-dpo	400，单卡H20，10min
10	internlm/internlm2_5-7b-chat		0.92 (sft prompt 0.94)	baseline
11	internlm/internlm2_5-7b-chat-lora-dpo	400，单卡H20，10min	0.92 (sft prompt 0.93)
12	internlm/internlm2_5-7b-chat-lora-sft	400，单卡H20，10min	0.95 (sft prompt 0.96)
13	internlm/internlm2_5-7b-chat-full-sft	400，双卡H20，7min	0.92 (sft prompt 0.92)
14	Qwen/Qwen2.5-32B-Instruct		0.93
15	Qwen/Qwen2.5-72B-Instruct		0.95
16	01-ai/Yi-1.5-6B-Chat		0.85
17	meta-llama/Meta-Llama-3.1-8B-Instruct		0.77
18	THUDM/chatglm3-6b		0.76

Post Views: 293

发表回复取消回复