DEEPSEEK-R1 模型 API 部署

news/2025/2/21 20:30:35/文章来源:https://www.cnblogs.com/saas-open/p/18725915

DEEPSEEK-R1 模型 API 部署

DeepSeek-R1 模型 API 部署一. 项目背景二. 项目需求三. 项目环境3.1 必要依赖3.2 安装依赖3.3 获取模型(1) models.py 模型文件(2) setting.py 配置文件3.4 书写 FastAPI 应用(1) main.py 文件3.5 部署到服务器(1) 启动 FastAPI 服务(2) 配置防火墙(3) 部署在云服务器四. 总结4.1 后续改进：

一. 项目背景

本项目旨在将 DeepSeek 模型部署为一个 Web API，使用户可以通过 HTTP 请求与模型进行交互。模型通过 FastAPI 框架提供服务，用户可以输入问题并获取模型生成的回答。我们将会在本文中详细描述如何配置、实现并部署这一系统。

二. 项目需求

DeepSeek 模型：需要一个已训练的 DeepSeek-R1-Distill-Qwen-1.5B 模型。
环境配置：使用 FastAPI 框架与 PyTorch 结合进行后端部署。
API 接口：提供一个 POST 接口供用户提交问题并返回生成的回答。
部署：将模型和服务部署到服务器，允许通过外部访问 API。

三. 项目环境

3.1 必要依赖

Python 版本：3.8+
FastAPI：用于快速构建 Web API 服务。
Uvicorn：作为 ASGI 服务器，用于运行 FastAPI 应用。
Transformers：用于加载和使用 Hugging Face 的预训练模型。
Torch：支持 GPU 加速计算。
Pydantic：用于数据验证和序列化。

3.2 安装依赖

安装依赖库：


"""
pip install fastapi uvicorn torch transformers pydantic

"""

配置模型文件：请确保已将 DeepSeek-R1-Distill-Qwen-1.5B 模型及其分词器下载并放置到项目中的相应路径。

3.3 获取模型

(1) `models.py` 模型文件

models.py 文件包含加载模型和分词器的逻辑：


import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from config import setting

# 获取模型路径

# 检查并加载模型和分词器
def load_model_and_tokenizer():   model_dir = setting.model_dir
   if not os.path.exists(model_dir):       raise ValueError(f"Model directory does not exist at {model_dir}")
   print("Loading model and tokenizer...")   model = AutoModelForCausalLM.from_pretrained(model_dir)   tokenizer = AutoTokenizer.from_pretrained(model_dir)
   return model, tokenizer

(2) `setting.py` 配置文件

setting.py 文件用于配置模型路径。


# 根路径
import os

# 获取当前文件的路径
current_dir = os.path.dirname(__file__)

# 获取项目根路径
project_root = os.path.dirname(current_dir)

# 拼接模型和分词器路径
model_dir = os.path.join(project_root, 'models', 'DeepSeek-R1-Distill-Qwen-1.5B')

print(model_dir)

# 确保模型和分词器的路径存在
if not os.path.exists(model_dir):   raise ValueError(f"Model directory does not exist at {model_dir}")
else:   print("Model directory exists, proceeding with loading.")

在该文件中，我们首先获取当前路径，并从根路径拼接出模型的实际存储路径。随后确保该路径存在，并加载模型和分词器。

3.4 书写 FASTAPI 应用

(1) `main.py` 文件

主要步骤：

加载模型和分词器：使用 models.py 中的 load_model_and_tokenizer() 方法加载模型和分词器。

设备选择：根据 GPU 可用性选择设备 (cuda 或 cpu)。

请求模型：使用 Pydantic 模型定义请求格式，包括用户输入的文本 input_text 和指令文本 instruction（具有默认值）。

生成响应：构建并传入结构化的 prompt，调用模型生成回答。

main.py 是 FastAPI 应用的核心文件。以下是代码实现：


import uvicorn
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import torch
from models import models

# 加载模型和分词器
model, tokenizer = models.load_model_and_tokenizer()

# 将模型移至正确的设备（使用 GPU 如果可用）
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# 定义 FastAPI 应用
app = FastAPI()

# 创建一个 Pydantic 模型用于接收输入数据
class RequestModel(BaseModel):   input_text: str   instruction: str = "你是一名乐于助人的助手"  # 新增字段，用于接收指令文本

# 生成响应的 API 路由
@app.post("/generate-response/")
async def generate_response(request: RequestModel):   input_text = request.input_text   instruction_text = request.instruction  # 获取传入的指令文本
   print(f"User input: {input_text}")   print(f"Instruction: {instruction_text}")
   # 动态构建 prompt，使用传入的指令文本   prompt_style_chat = f"""   请写出一个恰当的回答来完成当前对话任务。
   ### Instruction:   {instruction_text}  # 使用传入的指令文本
   ### Question:   {input_text}
   ### Response:   <think>"""
   # 使用分词器处理输入文本   inputs = tokenizer(prompt_style_chat, return_tensors="pt", padding=True, truncation=True, max_length=512)
   # 获取分词器的 pad_token_id   pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else model.config.pad_token_id
   try:       # 生成模型输出       with torch.no_grad():           outputs = model.generate(               inputs['input_ids'].to(device),               attention_mask=inputs['attention_mask'].to(device),               max_new_tokens=1200,  # 设置最大生成的 token 数量               temperature=1.0,               top_p=0.9,               pad_token_id=pad_token_id           )           print("Model.generate() completed.")   except Exception as e:       print(f"Error generating response: {e}")       raise HTTPException(status_code=500, detail=f"Error generating response: {e}")
   try:       # 解码生成的输出文本       response = tokenizer.decode(outputs[0], skip_special_tokens=True)       print(f"Generated response: {response}")       return {"response": response}   except Exception as e:       print(f"Error decoding output: {e}")       raise HTTPException(status_code=500, detail=f"Error decoding output: {e}")

if __name__ == "__main__":   uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)