一、引言
Gemma 是 Google 推出的轻量级、先进的开放模型系列,采用与 Gemini 模型相同的研究成果和技术构建而成。它们是仅使用解码器的文本到文本大型语言模型(提供英语版本),为预训练变体和指令调整变体具有开放权重。Gemma 模型非常适合各种文本生成任务,包括问题解答、摘要和推理。由于它们相对较小,因此可以将其部署在资源有限的环境(如笔记本电脑、桌面设备或您自己的云基础架构)中,让更多人能够使用先进的 AI 模型,并帮助促进每个人的创新。
Gemma2与他的上一代Gemma以及Qwen2等均采用decoder-only网络结构,主要参数情况如下:
与Gemma相同点:
与Gemma不同点:
分组查询注意力 (Grouped Query Attention) 是一种在大型语言模型中的多查询注意力 (MQA) 和多头注意力 (MHA) 之间进行插值的方法,它的目标是在保持 MQA 速度的同时实现 MHA 的质量
效果对比:
Gemma2 9B模型在多个维度超过近尺寸的Llama3 8B,27B尺寸模型在多个评价标准下超过314B的Grok-1:
通过AutoModelForCausalLM模型头查看模型结构:
Gemma2ForCausalLM(
(model): Gemma2Model(
(embed_tokens): Embedding(256000, 4608, padding_idx=0)
(layers): ModuleList(
(0-45): 46 x Gemma2DecoderLayer(
(self_attn): Gemma2SdpaAttention(
(q_proj): Linear(in_features=4608, out_features=4096, bias=False)
(k_proj): Linear(in_features=4608, out_features=2048, bias=False)
(v_proj): Linear(in_features=4608, out_features=2048, bias=False)
(o_proj): Linear(in_features=4096, out_features=4608, bias=False)
(rotary_emb): Gemma2RotaryEmbedding()
)
(mlp): Gemma2MLP(
(gate_proj): Linear(in_features=4608, out_features=36864, bias=False)
(up_proj): Linear(in_features=4608, out_features=36864, bias=False)
(down_proj): Linear(in_features=36864, out_features=4608, bias=False)
(act_fn): PytorchGELUTanh()
)
(input_layernorm): Gemma2RMSNorm()
(post_attention_layernorm): Gemma2RMSNorm()
(pre_feedforward_layernorm): Gemma2RMSNorm()
(post_feedforward_layernorm): Gemma2RMSNorm()
)
)
(norm): Gemma2RMSNorm()
)
(lm_head): Linear(in_features=4608, out_features=256000, bias=False)
)
在之前的文章中,我介绍过采用LlamaFactory的webui以及命令行进行模型训练,今天基于transformers库原生微调Gemma2。
我们仍然秉承一贯的作风,为网络不稳定的同学提供了modelscope下载方案:
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/gemma-2-27b-it')
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 或者 load_in_8bit=True,根据需要设置
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_compute_dtype=torch.bfloat16,#虽然我们以4位加载和存储模型,但我们在需要时会部分反量化他,并以16位精度进行计算
bnb_4bit_quant_type="nf4",#nf量化类型
bnb_4bit_use_double_quant=True,#双重量化,量化一次后再量化,进一步解决显存
)
tokenizer = AutoTokenizer.from_pretrained(model_dir,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir,trust_remote_code=True, device_map=device,torch_dtype=torch.bfloat16,quantization_config=quantization_config,attn_implementation='eager')
model.gradient_checkpointing_enable
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=32,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
from datasets import load_dataset,load_from_disk
data = load_dataset('json',data_files="./quotes.jsonl")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
print(data)
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=50,
learning_rate=3e-4,
fp16=True,
logging_steps=1,
output_dir="outputs/checkpoint-1"+time_str,
optim="paged_adamw_8bit",
save_strategy = 'steps',
save_steps = 10,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
trainer.save_model(trainer.args.output_dir)
注意:
from datetime import datetime
now = datetime.now()
time_str = now.strftime('%Y-%m-%d %H:%M:%S')
print(time_str)
#0,download model
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/gemma-2-27b-it')
#model_dir = snapshot_download('qwen/Qwen2-7B-Instruct')
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
device = "auto"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 或者 load_in_8bit=True,根据需要设置
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_compute_dtype=torch.bfloat16,#虽然我们以4位加载和存储模型,但我们在需要时会部分反量化他,并以16位精度进行计算
bnb_4bit_quant_type="nf4",#nf量化类型
bnb_4bit_use_double_quant=True,#双重量化,量化一次后再量化,进一步解决显存
)
tokenizer = AutoTokenizer.from_pretrained(model_dir,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir,trust_remote_code=True, device_map=device,torch_dtype=torch.bfloat16,quantization_config=quantization_config,attn_implementation='eager')
model.gradient_checkpointing_enable
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=32,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
from datasets import load_dataset,load_from_disk
data = load_dataset('json',data_files="./quotes.jsonl")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
print(data)
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=50,
learning_rate=3e-4,
fp16=True,
logging_steps=1,
output_dir="outputs/checkpoint-1"+time_str,
optim="paged_adamw_8bit",
save_strategy = 'steps',
save_steps = 10,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
trainer.save_model(trainer.args.output_dir)
采用CUDA_VISIBLE_DEVICES=1,2,3 python gemma2_train.py 启动
3张显卡启动:针对27B尺寸模型进行int4位微调,占用显存约28.9G。如果bf16微调,大约需要54G。相比于LLama3、Qwen2等72B尺寸模型的优势就是仅消耗单卡A100即可bf16微调训练。
这里比较重要的是peft中的PeftModel和PeftConfig,PeftModel用于合并基座与微调模型,PeftConfig用于提取Peft微调模型的配置文件
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
peft_model_dir = trainer.args.output_dir
config = PeftConfig.from_pretrained(peft_model_dir)
print(config)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path, return_dict=True, device_map=device,
torch_dtype=torch.float16, quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_dir)
chat=[
{"role": "user", "content": "详细介绍一下大语言模型,评价下与深度学习的差异"},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(prompt,max_length=2500)
outputs = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt, outputs)
]
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
基座模型和微调模型合并后,大约需要40G??
from datetime import datetime
now = datetime.now()
time_str = now.strftime('%Y-%m-%d %H:%M:%S')
print(time_str)
#0,download model
from modelscope import snapshot_download
model_dir = snapshot_download('LLM-Research/gemma-2-27b-it')
#model_dir = snapshot_download('qwen/Qwen2-7B-Instruct')
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM,BitsAndBytesConfig
device = "auto"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True, # 或者 load_in_8bit=True,根据需要设置
llm_int8_enable_fp32_cpu_offload=True,
bnb_4bit_compute_dtype=torch.bfloat16,#虽然我们以4位加载和存储模型,但我们在需要时会部分反量化他,并以16位精度进行计算
bnb_4bit_quant_type="nf4",#nf量化类型
bnb_4bit_use_double_quant=True,#双重量化,量化一次后再量化,进一步解决显存
)
tokenizer = AutoTokenizer.from_pretrained(model_dir,trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir,trust_remote_code=True, device_map=device,torch_dtype=torch.bfloat16,quantization_config=quantization_config,attn_implementation='eager')
model.gradient_checkpointing_enable
from peft import LoraConfig,get_peft_model,prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
config = LoraConfig(
r=32,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj","down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
from datasets import load_dataset,load_from_disk
data = load_dataset('json',data_files="./quotes.jsonl")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
print(data)
trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=50,
learning_rate=3e-4,
fp16=True,
logging_steps=1,
output_dir="outputs/checkpoint-1"+time_str,
optim="paged_adamw_8bit",
save_strategy = 'steps',
save_steps = 10,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
#trainer.train()
trainer.save_model(trainer.args.output_dir)
# merge model and inference
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
#peft_model_dir = trainer.args.output_dir
peft_model_dir = "/aigc_dev/gemma2/outputs/checkpoint-12024-07-04 21:57:45"
config = PeftConfig.from_pretrained(peft_model_dir)
print(config)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path, return_dict=True, device_map=device,
torch_dtype=torch.bfloat16, quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_dir)
chat=[
{"role": "user", "content": "详细介绍一下大语言模型,评价下与深度学习的差异"},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(prompt,max_length=2500)
outputs = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt, outputs)
]
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
在模型结构上,Gemma2与Qwen2非常相似,除了decoder-only、RoPE、分组查询注意力机制等技术相同,线性层(Lora的目标层)均为
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj","down_proj"]
中文对话效果上经过多个样例测试个人感觉不如国产的Qwen2、GLM4、DeepSeek等。
GOOGLE作为互联网技术老大哥,在大模型的角逐中,并没有那么强势。可叹啊!