首頁手記 Qwen2大模型微調入門實戰（附完整代碼）（非常詳細）...

Qwen2大模型微調入門實戰（附完整代碼）（非常詳細）零基礎入門到精通，收藏這一篇就夠了

標簽：

雜七雜八

本文提供了一个从零基础入门到精通Qwen2大模型微调的实战教程，覆盖了环境安装、数据集准备、模型加载与配置训练过程。教程详尽介绍了使用复旦中文新闻数据集进行训练的方法，并附带完整代码实现。通过SwanLab集成监控训练过程，确保学习者能够直观了解训练细节。代码结构清晰，包括数据预处理、模型训练、结果展示与推理模型使用。

Qwen2大模型微调入门实战（附完整代码）（非常详细）零基础入门到精通，收藏这一篇就够了

环境安装

本案例基于Python>=3.8，以下Python库为必要安装：

pip install swanlab==0.3.9 modelscope==1.14.0 transformers==4.41.2 datasets==2.18.0 peft==0.11.1 accelerate==0.30.1 pandas==1.4.2

测试环境配置：

modelscope版本：1.14.0
transformers版本：4.41.2
datasets版本：2.18.0
peft版本：0.11.1
accelerate版本：0.30.1
swanlab版本：0.3.9

准备数据集

使用复旦中文新闻数据集进行训练，数据集下载方式：

访问魔搭社区下载文件：train.jsonl 和 test.jsonl
将其放入本地根目录

加载模型

使用模型仓库下载Qwen2-1.5B-Instruct模型：

from modelscope import snapshot_download

model_dir = snapshot_download("qwen/Qwen2-1.5B-Instruct")

加载模型：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
model.enable_input_require_grads()

配置训练可视化工具

集成SwanLab用于监控训练过程：

from swanlab.integration.huggingface import SwanLabCallback

swanlab_callback = SwanLabCallback()

完整代码实现

数据预处理

from datasets import Dataset

def dataset_jsonl_transfer(origin_path, new_path):
    messages = []
    with open(origin_path, 'r') as file:
        for line in file:
            data = json.loads(line)
            text, category, output = data['text'], data['category'], data['output']
            messages.append({
                "instruction": "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型",
                "input": f"文本:{text},类型选型:{category}",
                "output": output,
            })
    with open(new_path, 'w', encoding='utf-8') as file:
        file.write('\n'.join(json.dumps(message, ensure_ascii=False) for message in messages))

def process_func(example):
    MAX_LENGTH = 384
    instruction = tokenizer(f"系统提示\n你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型\n用户提问\n{example['input']}\n系统回答\n")
    response = tokenizer(f"{example['output']}")
    input_ids, attention_mask, labels = process_text(instruction, response)
    if len(input_ids) > MAX_LENGTH:
        input_ids = input_ids[:MAX_LENGTH]
        attention_mask = attention_mask[:MAX_LENGTH]
        labels = labels[:MAX_LENGTH]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

模型训练过程

def process_text(instruction, response):
    input_ids = instruction['input_ids'] + response['input_ids'] + [tokenizer.pad_token_id]
    attention_mask = instruction['attention_mask'] + response['attention_mask'] + [1]
    labels = [-100] * len(instruction['input_ids']) + response['input_ids'] + [tokenizer.pad_token_id]
    return input_ids, attention_mask, labels

def main():
    dataset_jsonl_transfer('train.jsonl', 'new_train.jsonl')
    dataset_jsonl_transfer('test.jsonl', 'new_test.jsonl')
    model_dir = 'qwen/Qwen2-1.5B-Instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", torch_dtype=torch.bfloat16)
    model.enable_input_require_grads()

    train_df = pd.read_json('new_train.jsonl', lines=True)
    train_ds = Dataset.from_pandas(train_df)
    train_dataset = train_ds.map(process_func, remove_columns=train_ds.column_names)

    trainer = Trainer(
        model=model,
        args=TrainingArguments(),
        train_dataset=train_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        callbacks=[swanlab_callback],
    )
    trainer.train()

    test_df = pd.read_json('new_test.jsonl', lines=True)
    for index, row in test_df.iterrows():
        instruction, input_value = row['instruction'], row['input']
        messages = [{"role": "system", "content": f"{instruction}"},
                    {"role": "user", "content": f"{input_value}"}]
        response = predict(model, tokenizer, messages)
        print(response)

训练结果演示

在SwanLab监控训练过程，查看结果展示。

推理训练好的模型

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('your_model_directory', device_map="auto", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('your_model_directory')

instruction = "你是一个文本分类领域的专家，你会接收到一段文本和几个潜在的分类选项，请输出文本内容的正确类型"
input_value = "文本:航空动力学报JOURNAL OF AEROSPACE POWER1998年 第4期 No.4 1998科技期刊管路系统敷设的并行工程模型研究*陈志英*　*　马　枚北京航空航天大学【摘要】　提出了一种应用于并行工程模型转换研究的标号法，该法是将现行串行设计过程(As-is)转换为并行设计过程(To-be)。本文应用该法将发动机外部管路系统敷设过程模型进行了串并行转换，应用并行工程过程重构的手段，得到了管路敷设并行过程模型。"

messages = [{"role": "system", "content": f"{instruction}"},
            {"role": "user", "content": f"{input_value}"}]

response = predict(model, tokenizer, messages)
print(response)

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

Qwen2大模型微調入門實戰（附完整代碼）（非常詳細）零基礎入門到精通，收藏這一篇就夠了

环境安装

准备数据集

加载模型

配置训练可视化工具

完整代码实现

数据预处理

模型训练过程

训练结果演示

推理训练好的模型

相关链接

閱讀免費教程

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Qwen2大模型微調入門實戰（附完整代碼）（非常詳細）零基礎入門到精通，收藏這一篇就夠了

环境安装

准备数据集

加载模型

配置训练可视化工具

完整代码实现

数据预处理

模型训练过程

训练结果演示

推理训练好的模型

相关链接

閱讀免費教程

Qwen2大模型微調入門實戰（附完整代碼）（非常詳細）零基礎入門到精通，收藏這一篇就夠了