前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >【万字长文】【InternVL】InternVL2-26B模型部署全攻略

【万字长文】【InternVL】InternVL2-26B模型部署全攻略

原创
作者头像
知冷煖
发布2025-01-16 16:45:00
发布2025-01-16 16:45:00
27100
代码可运行
举报
文章被收录于专栏:多模态多模态
运行总次数:0
代码可运行

一、LMDeploy介绍【模型部署框架】

1-1、LMDeploy介绍

LMDeploy 是一个用于压缩、部署、服务 LLM 的工具包,由 MMRazor 和 MMDeploy 团队开发。它具有以下核心功能

  • 高效推理引擎(TurboMind):开发持久批处理(又称连续批处理)、阻塞KV缓存、动态拆分融合、张量并行、高性能CUDA内核等关键特性,确保LLM推理的高吞吐和低延迟。
  • 交互式推理模式:通过在多轮对话过程中缓存注意力的k/v,引擎会记住对话历史,从而避免历史会话的重复处理。
  • 量化:LMDeploy 支持多种量化方法和量化模型的高效推理。量化的可靠性已在不同尺度的模型上得到验证。

TurboMind CUDA 平台 支持的模型如下所示:

1-2、LLM推理

代码语言:python
代码运行次数:0
复制
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)

在构造 pipeline 时,如果没有指定使用 TurboMind 引擎或 PyTorch 引擎进行推理,LMDeploy 将根据它们各自的能力自动分配一个,默认优先使用 TurboMind 引擎。from lmdeploy import pipeline, TurbomindEngineConfig pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=TurbomindEngineConfig( max_batch_size=32, enable_prefix_caching=True, cache_max_entry_count=0.8, session_len=8192, ))

代码语言:c
代码运行次数:0
复制
from lmdeploy import pipeline, PytorchEngineConfig
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=PytorchEngineConfig(
                    max_batch_size=32,
                    enable_prefix_caching=True,
                    cache_max_entry_count=0.8,
                    session_len=8192,
                ))

1-3、VLM 推理

VLM 推理 pipeline 与 LLM 类似,但增加了使用 pipeline 处理图像数据的能力。例如,你可以使用以下代码片段对 InternVL 模型进行推理:from lmdeploy import pipeline from lmdeploy.vl import load_imagepipe = pipeline('OpenGVLab/InternVL2-8B')image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg') response = pipe(('describe this image', image)) print(response)

二、InternVL2-26B【介绍&加载&推理】

2-1、InternVL 2.0介绍

InternVL 2.0 : 是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种指令微调的模型,参数从 10 亿到 1080 亿不等。

特点如下:

  • 与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
  • InternVL 2.0 使用 8k 上下文窗口进行训练,训练数据包含长文本、多图和视频数据,与 InternVL 1.5 相比,其处理这些类型输入的能力显著提高。

InternVL 2.0各个模型如下所示:

如下图所示: 与其他同类模型相比,InternVL2-26B极具竞争力。

2-2、16位量化加载

安装包:

代码语言:c
代码运行次数:0
复制
pip install transformers==4.37.2

16位量化加载代码:

代码语言:c
代码运行次数:0
复制
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()

8位量化加载代码:

代码语言:c
代码运行次数:0
复制
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval()

Notice: 这里的path换为模型下载到的路径,在Linux上的路径一般为:

/root/.cache/modelscope/hub/

2-3、多GPU加载

多GPU加载代码如下:

  • split_model:根据模型的大小,将模型的各层分配到多个GPU上,以优化显存的使用。world_size用于计算可用GPU数量
  • num_layers_per_gpu : 表示每个GPU分配的需要处理的模型层数
  • num_layers_per_gpu0: 由于第一个GPU将用于ViT(视觉模型的一部分),所以为了减轻0号GPU负担,可以酌情减少其平均分配层数。
  • device_mapxxxx: 注意这一部分,默认将视觉模型和其他一些组件固定分配到第一个GPU上,但是如果显存不够,可以均匀分配到其他GPU以减轻0号GPU压力。
代码语言:c
代码运行次数:0
复制
import math
import torch
from transformers import AutoTokenizer, AutoModel

def split_model(model_name):
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
        'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "OpenGVLab/InternVL2-26B"
device_map = split_model('InternVL2-26B')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

2-4、使用Transformer进行推理

build_transform函数: 该函数构建了一系列图像转换,包括:

  • 将图像转换为RGB格式。
  • 将图像调整为指定大小。
  • 将图像转换为张量并进行标准化。

find_closest_aspect_ratio函数: 用于在一组目标宽高比中,找到与给定图像宽高比最接近的一个,并返回最佳的宽高比。

dynamic_preprocess函数: 动态预处理图像,将图像切分为多个小块,以适应模型的输入要求。该函数的主要流程包括:

  • 计算图像的宽高比并找到最接近的目标宽高比。
  • 调整图像的大小,并根据计算得到的块数量将图像分割为多个小块。
  • 选择是否添加缩略图,以补充图像信息。

load_image函数

  • 打开图像文件并转换为RGB格式。
  • 调用build_transform和dynamic_preprocess对图像进行预处理。
  • 将图像转换为张量并堆叠在一起,准备输入模型。

完整代码如下所示:

代码语言:c
代码运行次数:0
复制
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-26B'
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')

# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]

question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list,
                               history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)

questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
                             num_patches_list=num_patches_list,
                             questions=questions,
                             generation_config=generation_config)
for question, response in zip(questions, responses):
    print(f'User: {question}\nAssistant: {response}')

# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
    if bound:
        start, end = bound[0], bound[1]
    else:
        start, end = -100000, 100000
    start_idx = max(first_idx, round(start * fps))
    end_idx = min(round(end * fps), max_frame)
    seg_size = float(end_idx - start_idx) / num_segments
    frame_indices = np.array([
        int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
        for idx in range(num_segments)
    ])
    return frame_indices

def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    max_frame = len(vr) - 1
    fps = float(vr.get_avg_fps())

    pixel_values_list, num_patches_list = [], []
    transform = build_transform(input_size=input_size)
    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
    for frame_index in frame_indices:
        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
        pixel_values = [transform(tile) for tile in img]
        pixel_values = torch.stack(pixel_values)
        num_patches_list.append(pixel_values.shape[0])
        pixel_values_list.append(pixel_values)
    pixel_values = torch.cat(pixel_values_list)
    return pixel_values, num_patches_list

video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')

question = 'Describe this video in detail. Don\'t repeat.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
                               num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')

2-5、Demo

Notice: 个人写的一个小Demo,感兴趣的小伙伴可以看一下。

代码语言:c
代码运行次数:0
复制
import math
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    """
    源代码
    """
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_name):
    """
    源代码
    """
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = {
        'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
        'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    
    # 因为第一个GPU要分配给ViT部分,所以这里首先分给它的层数少一些。
    # num_layers_per_gpu是一个列表,表示每个GPU分配的层数,并且由于第一个GPU的特殊性,只给他分一层(计算后得到的结果)。
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.1)
    #print(num_layers_per_gpu)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
            
    # 将模型组件,依次分配到各个GPU上,其中vision_model占用的显存最多,切记合理分配。
    device_map['vision_model'] = 0
    device_map['mlp1'] = 1
    device_map['language_model.model.tok_embeddings'] = 2
    device_map['language_model.model.embed_tokens'] = 3
    device_map['language_model.output'] = 4
    device_map['language_model.model.norm'] = 5
    device_map['language_model.lm_head'] = 6
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 7

    return device_map

def main():
    path = "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B"
    device_map = split_model('InternVL2-26B')
    model = AutoModel.from_pretrained(
        path,
        torch_dtype=torch.bfloat16,
#    load_in_8bit=True,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        device_map=device_map).eval()
    tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
    pixel_values = load_image('./1.png', max_num=12).to(torch.bfloat16).cuda()
    generation_config = dict(max_new_tokens=1024, do_sample=False)

    question = """<image>\n
    [任务] 给出图中衣服或者其他物品的:
    1、详细描述。
    2、该物品的风格。
    3、适合的场景。
    4、搭配建议。
    [注意] 风格和场景参考以下给出的列表。物品适合的风格和场景可以有多个。
    [风格列表]极简风 (Minimalist), 波西米亚风 (Bohemian), 街头风 (Streetwear), 复古风 (Vintage), 中性风 (Androgynous), 运动休闲风 (Athleisure), 雅痞风 (Preppy), 学院风 (Collegiate), 朋克风 (Punk), 哥特风 (Gothic), 奢华风 (Luxurious), 优雅风 (Elegant), 商务休闲风 (Business Casual), 高街风 (High Street), 工装风 (Utility/Workwear), 军事风 (Military), 户外风 (Outdoor), 田园风 (Cottagecore), 摩登风 (Modern), 法式风 (French Chic), 英伦风 (British), 乡村风 (Country Style), 嘻哈风 (Hip Hop), 简约优雅风 (Sophisticated Minimalism), 浪漫风 (Romantic), 海军风 (Nautical), 洛丽塔风 (Lolita), 未来风 (Futuristic), 摩托风 (Biker), 朋克洛丽塔风 (Punk Lolita), 宫廷风 (Baroque/Rococo), 东方风 (Oriental/Asian), 性感风 (Sexy), 度假风 (Resort), 摇滚风 (Rock), 艺术家风 (Artsy), 超模风 (Model-Off-Duty), 探险风 (Explorer), 丛林风 (Safari), 热带风 (Tropical), 工艺风 (Artisan), 环保风 (Sustainable), 日本原宿风 (Harajuku), 时尚运动风 (Sportswear Chic), 街头时尚风 (Urban Fashion), 经典正式风 (Classic Formal), 沙滩风 (Beachwear), 俱乐部风 (Clubwear), 黑暗童话风 (Dark Fairytale), 复古迪斯科风 (Retro Disco) 
    [场景列表]办公室 (Office), 商务会议 (Business Meeting), 正式宴会 (Formal Dinner), 婚礼 (Wedding), 鸡尾酒会 (Cocktail Party), 面试 (Job Interview), 约会 (Date), 度假 (Vacation), 沙滩派对 (Beach Party), 音乐节 (Music Festival), 音乐会 (Concert), 户外野餐 (Outdoor Picnic), 健身房 (Gym), 瑜伽课 (Yoga Class), 婚礼伴娘 / 伴郎 (Bridesmaid/Groomsman at Wedding), 毕业典礼 (Graduation Ceremony), 生日派对 (Birthday Party), 家庭聚会 (Family Gathering), 教堂 / 宗教仪式 (Church/Religious Ceremony), 朋友聚会 (Friends’ Get-together), 高尔夫场 (Golf Course), 剧院 / 歌剧院 (Theatre/Opera), 飞机旅行 (Airplane Travel), 购物逛街 (Shopping), 商务午餐 (Business Lunch), 下午茶 (Afternoon Tea), 红毯活动 (Red Carpet Event), 舞会 (Ball/Gala), 晚宴 (Dinner Party), 滑雪度假 (Ski Resort), 新年派对 (New Year’s Eve Party), 宠物聚会 (Pet Party), 音乐节日巡游 (Festival Parade), 公司年会 (Corporate Annual Meeting), 主题派对 (Theme Party), 开幕酒会 (Art Gallery Opening), 慈善晚宴 (Charity Dinner), 约见客户 (Client Meeting), 产后派对 (Baby Shower), 运动赛事观赛 (Sporting Event Spectator), 小型私人派对 (Intimate House Party), 夜店 (Nightclub), 商务展会 (Business Expo), 户外露营 (Outdoor Camping), 游艇派对 (Yacht Party), 时尚发布会 (Fashion Show), 博物馆参观 (Museum Visit), 酒吧聚会 (Bar Gathering), 读书会 (Book Club), 出差旅行 (Business Travel)"""
    response = model.chat(tokenizer, pixel_values, question, generation_config)
    print(f'User: {question}\nAssistant: {response}')

if __name__ == "__main__":
    main()

三、模型部署

3-1、安装

代码语言:c
代码运行次数:0
复制
pip install lmdeploy==0.5.3

其他LMDeploy模型所需的依赖:

代码语言:c
代码运行次数:0
复制
pip install timm
# 建议从https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包
pip install flash-attn

3-2、api_server

参数介绍:

  • backend : 指定推理引擎turbomind
  • server-port:指定端口号
  • tp:多卡时使用(GPU数量)
代码语言:c
代码运行次数:0
复制
lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 23333

部署后的显存占用: 需要两百G的显存,如果并发较大或者是需要历史对话的交互,建议至少增加50G显存。

API接口界面:

4-bit量化模型AWQ的部署:

代码语言:c
代码运行次数:0
复制
lmdeploy serve api_server OpenGVLab/InternVL2-26B-AWQ --backend turbomind --server-port 23333 --model-format awq

3-3、客户端调用

Demo1: 使用官方提供的接口代码,调用/v1/chat/completions 接口。

代码语言:c
代码运行次数:0
复制
from lmdeploy.serve.openai.api_client import APIClient

api_client = APIClient(f'http://0.0.0.0:23333')
model_name = api_client.available_models[0]
messages = [{
    'role':
    'user',
    'content': [{
        'type': 'text',
        'text': 'Describe the image please',
    }, {
        'type': 'image_url',
        'image_url': {
            'url':
            'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
        },
    }]
}]
for item in api_client.chat_completions_v1(model=model_name,
                                           messages=messages):
    print(item)

Demo2: 自定义常用的Request,调用/v1/chat/completions 接口。

代码语言:c
代码运行次数:0
复制
import requests
import json
import logging

# 配置日志记录
logging.basicConfig(level=logging.INFO, filename='app.log', filemode='a',
                    format='%(name)s - %(levelname)s - %(message)s')

# 设置请求的URL
url = 'http://0.0.0.0:23333/v1/chat/completions'

# 设置请求的JSON数据
data = {
    "model": "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B/",  # 替换为您的模型名称或ID
    "messages": [{
        "role": "user",
        "content": [{
            "type": "text",
            "text": "描述一下这张图片的内容"
            }, 
            {
            "type": "image_url",
            "image_url": {
                "url": ""
            }
        }]
    }],
    "temperature": 0.8,
    "top_p": 0.8
}

# 将请求数据转换为JSON格式
headers = {'Content-Type': 'application/json'}

try:
    for i in range(1):
        # 发送POST请求
        response = requests.post(url, data=json.dumps(data), headers=headers)
        
        # 检查请求是否成功
        if response.status_code == 200:
            logging.info("请求成功!响应内容:%s", response.json())
            print(f"请求成功!响应内容:\n{i}\n{response.json()}")
        else:
            logging.error("请求失败,状态码:%s", response.status_code)
            print(f"请求失败,状态码:{response.status_code}")
except requests.exceptions.RequestException as e:
    logging.error("请求异常:%s", str(e))
    print(f"请求异常:{e}")
except Exception as e:
    logging.error("发生错误:%s", str(e))
    print(f"发生错误:{e}")

附录

1、显存查看命令

默认查看显存:

代码语言:c
代码运行次数:0
复制
nvidia-smi

动态查看显存:

代码语言:c
代码运行次数:0
复制
watch -n 0.5 nvidia-smi

如下所示:

2、/v1/chat/interactive 接口注意事项

Notice: 这里一定要注意,这个互动接口,虽然文档内明确表示

  • 在交互模式下,聊天记录保存在服务器上。请设置 。interactive_mode = True
  • 在正常模式下,服务器上不会保留任何聊天记录。设置。interactive_mode = False

但是实践证明,就算interactive_mode = False ,调用api的过程中,服务器上的显存也会一直增长,最终导致显存爆炸,慎用!!另外,completions接口不会有这个问题。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、LMDeploy介绍【模型部署框架】
    • 1-1、LMDeploy介绍
    • 1-2、LLM推理
    • 1-3、VLM 推理
  • 二、InternVL2-26B【介绍&加载&推理】
    • 2-1、InternVL 2.0介绍
    • 2-2、16位量化加载
    • 2-3、多GPU加载
    • 2-4、使用Transformer进行推理
    • 2-5、Demo
  • 三、模型部署
    • 3-1、安装
    • 3-2、api_server
    • 3-3、客户端调用
  • 附录
    • 1、显存查看命令
    • 2、/v1/chat/interactive 接口注意事项
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档