LMDeploy 是一个用于压缩、部署、服务 LLM 的工具包,由 MMRazor 和 MMDeploy 团队开发。它具有以下核心功能:
TurboMind CUDA 平台 支持的模型如下所示:
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2_5-7b-chat")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
在构造 pipeline 时,如果没有指定使用 TurboMind 引擎或 PyTorch 引擎进行推理,LMDeploy 将根据它们各自的能力自动分配一个,默认优先使用 TurboMind 引擎。from lmdeploy import pipeline, TurbomindEngineConfig pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=TurbomindEngineConfig( max_batch_size=32, enable_prefix_caching=True, cache_max_entry_count=0.8, session_len=8192, ))
from lmdeploy import pipeline, PytorchEngineConfig
pipe = pipeline('internlm/internlm2_5-7b-chat',
backend_config=PytorchEngineConfig(
max_batch_size=32,
enable_prefix_caching=True,
cache_max_entry_count=0.8,
session_len=8192,
))
VLM 推理 pipeline 与 LLM 类似,但增加了使用 pipeline 处理图像数据的能力。例如,你可以使用以下代码片段对 InternVL 模型进行推理:from lmdeploy import pipeline from lmdeploy.vl import load_imagepipe = pipeline('OpenGVLab/InternVL2-8B')image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg') response = pipe(('describe this image', image)) print(response)
InternVL 2.0 : 是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种指令微调的模型,参数从 10 亿到 1080 亿不等。
特点如下:
InternVL 2.0各个模型如下所示:
如下图所示: 与其他同类模型相比,InternVL2-26B极具竞争力。
安装包:
pip install transformers==4.37.2
16位量化加载代码:
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
8位量化加载代码:
import torch
from transformers import AutoTokenizer, AutoModel
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval()
Notice: 这里的path换为模型下载到的路径,在Linux上的路径一般为:
/root/.cache/modelscope/hub/
多GPU加载代码如下:
import math
import torch
from transformers import AutoTokenizer, AutoModel
def split_model(model_name):
device_map = {}
world_size = torch.cuda.device_count()
num_layers = {
'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
path = "OpenGVLab/InternVL2-26B"
device_map = split_model('InternVL2-26B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
build_transform函数: 该函数构建了一系列图像转换,包括:
find_closest_aspect_ratio函数: 用于在一组目标宽高比中,找到与给定图像宽高比最接近的一个,并返回最佳的宽高比。
dynamic_preprocess函数: 动态预处理图像,将图像切分为多个小块,以适应模型的输入要求。该函数的主要流程包括:
load_image函数
完整代码如下所示:
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# If you have an 80G A100 GPU, you can put the entire model on a single GPU.
# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
path = 'OpenGVLab/InternVL2-26B'
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
# set the max number of tiles in `max_num`
pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)
# pure-text conversation (纯文本对话)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
question = 'Can you tell me a story?'
response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# single-image single-round conversation (单图单轮对话)
question = '<image>\nPlease describe the image shortly.'
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')
# single-image multi-round conversation (单图多轮对话)
question = '<image>\nPlease describe the image in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
question = 'Please write a poem according to the image.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
question = '<image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
question = 'What are the similarities and differences between these two images.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list,
history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')
# batch inference, single image per sample (单图批处理)
pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
responses = model.batch_chat(tokenizer, pixel_values,
num_patches_list=num_patches_list,
questions=questions,
generation_config=generation_config)
for question, response in zip(questions, responses):
print(f'User: {question}\nAssistant: {response}')
# video multi-round conversation (视频多轮对话)
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
if bound:
start, end = bound[0], bound[1]
else:
start, end = -100000, 100000
start_idx = max(first_idx, round(start * fps))
end_idx = min(round(end * fps), max_frame)
seg_size = float(end_idx - start_idx) / num_segments
frame_indices = np.array([
int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
for idx in range(num_segments)
])
return frame_indices
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
max_frame = len(vr) - 1
fps = float(vr.get_avg_fps())
pixel_values_list, num_patches_list = [], []
transform = build_transform(input_size=input_size)
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
for frame_index in frame_indices:
img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(tile) for tile in img]
pixel_values = torch.stack(pixel_values)
num_patches_list.append(pixel_values.shape[0])
pixel_values_list.append(pixel_values)
pixel_values = torch.cat(pixel_values_list)
return pixel_values, num_patches_list
video_path = './examples/red-panda.mp4'
pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
pixel_values = pixel_values.to(torch.bfloat16).cuda()
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
question = video_prefix + 'What is the red panda doing?'
# Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
question = 'Describe this video in detail. Don\'t repeat.'
response, history = model.chat(tokenizer, pixel_values, question, generation_config,
num_patches_list=num_patches_list, history=history, return_history=True)
print(f'User: {question}\nAssistant: {response}')
Notice: 个人写的一个小Demo,感兴趣的小伙伴可以看一下。
import math
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
"""
源代码
"""
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
def split_model(model_name):
"""
源代码
"""
device_map = {}
world_size = torch.cuda.device_count()
num_layers = {
'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
# 因为第一个GPU要分配给ViT部分,所以这里首先分给它的层数少一些。
# num_layers_per_gpu是一个列表,表示每个GPU分配的层数,并且由于第一个GPU的特殊性,只给他分一层(计算后得到的结果)。
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.1)
#print(num_layers_per_gpu)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
# 将模型组件,依次分配到各个GPU上,其中vision_model占用的显存最多,切记合理分配。
device_map['vision_model'] = 0
device_map['mlp1'] = 1
device_map['language_model.model.tok_embeddings'] = 2
device_map['language_model.model.embed_tokens'] = 3
device_map['language_model.output'] = 4
device_map['language_model.model.norm'] = 5
device_map['language_model.lm_head'] = 6
device_map[f'language_model.model.layers.{num_layers - 1}'] = 7
return device_map
def main():
path = "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B"
device_map = split_model('InternVL2-26B')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
# load_in_8bit=True,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map=device_map).eval()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
pixel_values = load_image('./1.png', max_num=12).to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=False)
question = """<image>\n
[任务] 给出图中衣服或者其他物品的:
1、详细描述。
2、该物品的风格。
3、适合的场景。
4、搭配建议。
[注意] 风格和场景参考以下给出的列表。物品适合的风格和场景可以有多个。
[风格列表]极简风 (Minimalist), 波西米亚风 (Bohemian), 街头风 (Streetwear), 复古风 (Vintage), 中性风 (Androgynous), 运动休闲风 (Athleisure), 雅痞风 (Preppy), 学院风 (Collegiate), 朋克风 (Punk), 哥特风 (Gothic), 奢华风 (Luxurious), 优雅风 (Elegant), 商务休闲风 (Business Casual), 高街风 (High Street), 工装风 (Utility/Workwear), 军事风 (Military), 户外风 (Outdoor), 田园风 (Cottagecore), 摩登风 (Modern), 法式风 (French Chic), 英伦风 (British), 乡村风 (Country Style), 嘻哈风 (Hip Hop), 简约优雅风 (Sophisticated Minimalism), 浪漫风 (Romantic), 海军风 (Nautical), 洛丽塔风 (Lolita), 未来风 (Futuristic), 摩托风 (Biker), 朋克洛丽塔风 (Punk Lolita), 宫廷风 (Baroque/Rococo), 东方风 (Oriental/Asian), 性感风 (Sexy), 度假风 (Resort), 摇滚风 (Rock), 艺术家风 (Artsy), 超模风 (Model-Off-Duty), 探险风 (Explorer), 丛林风 (Safari), 热带风 (Tropical), 工艺风 (Artisan), 环保风 (Sustainable), 日本原宿风 (Harajuku), 时尚运动风 (Sportswear Chic), 街头时尚风 (Urban Fashion), 经典正式风 (Classic Formal), 沙滩风 (Beachwear), 俱乐部风 (Clubwear), 黑暗童话风 (Dark Fairytale), 复古迪斯科风 (Retro Disco)
[场景列表]办公室 (Office), 商务会议 (Business Meeting), 正式宴会 (Formal Dinner), 婚礼 (Wedding), 鸡尾酒会 (Cocktail Party), 面试 (Job Interview), 约会 (Date), 度假 (Vacation), 沙滩派对 (Beach Party), 音乐节 (Music Festival), 音乐会 (Concert), 户外野餐 (Outdoor Picnic), 健身房 (Gym), 瑜伽课 (Yoga Class), 婚礼伴娘 / 伴郎 (Bridesmaid/Groomsman at Wedding), 毕业典礼 (Graduation Ceremony), 生日派对 (Birthday Party), 家庭聚会 (Family Gathering), 教堂 / 宗教仪式 (Church/Religious Ceremony), 朋友聚会 (Friends’ Get-together), 高尔夫场 (Golf Course), 剧院 / 歌剧院 (Theatre/Opera), 飞机旅行 (Airplane Travel), 购物逛街 (Shopping), 商务午餐 (Business Lunch), 下午茶 (Afternoon Tea), 红毯活动 (Red Carpet Event), 舞会 (Ball/Gala), 晚宴 (Dinner Party), 滑雪度假 (Ski Resort), 新年派对 (New Year’s Eve Party), 宠物聚会 (Pet Party), 音乐节日巡游 (Festival Parade), 公司年会 (Corporate Annual Meeting), 主题派对 (Theme Party), 开幕酒会 (Art Gallery Opening), 慈善晚宴 (Charity Dinner), 约见客户 (Client Meeting), 产后派对 (Baby Shower), 运动赛事观赛 (Sporting Event Spectator), 小型私人派对 (Intimate House Party), 夜店 (Nightclub), 商务展会 (Business Expo), 户外露营 (Outdoor Camping), 游艇派对 (Yacht Party), 时尚发布会 (Fashion Show), 博物馆参观 (Museum Visit), 酒吧聚会 (Bar Gathering), 读书会 (Book Club), 出差旅行 (Business Travel)"""
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(f'User: {question}\nAssistant: {response}')
if __name__ == "__main__":
main()
pip install lmdeploy==0.5.3
其他LMDeploy模型所需的依赖:
pip install timm
# 建议从https://github.com/Dao-AILab/flash-attention/releases寻找和环境匹配的whl包
pip install flash-attn
参数介绍:
lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 23333
部署后的显存占用: 需要两百G的显存,如果并发较大或者是需要历史对话的交互,建议至少增加50G显存。
API接口界面:
4-bit量化模型AWQ的部署:
lmdeploy serve api_server OpenGVLab/InternVL2-26B-AWQ --backend turbomind --server-port 23333 --model-format awq
Demo1: 使用官方提供的接口代码,调用/v1/chat/completions 接口。
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient(f'http://0.0.0.0:23333')
model_name = api_client.available_models[0]
messages = [{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Describe the image please',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg',
},
}]
}]
for item in api_client.chat_completions_v1(model=model_name,
messages=messages):
print(item)
Demo2: 自定义常用的Request,调用/v1/chat/completions 接口。
import requests
import json
import logging
# 配置日志记录
logging.basicConfig(level=logging.INFO, filename='app.log', filemode='a',
format='%(name)s - %(levelname)s - %(message)s')
# 设置请求的URL
url = 'http://0.0.0.0:23333/v1/chat/completions'
# 设置请求的JSON数据
data = {
"model": "/root/.cache/modelscope/hub/OpenGVLab/InternVL2-26B/", # 替换为您的模型名称或ID
"messages": [{
"role": "user",
"content": [{
"type": "text",
"text": "描述一下这张图片的内容"
},
{
"type": "image_url",
"image_url": {
"url": ""
}
}]
}],
"temperature": 0.8,
"top_p": 0.8
}
# 将请求数据转换为JSON格式
headers = {'Content-Type': 'application/json'}
try:
for i in range(1):
# 发送POST请求
response = requests.post(url, data=json.dumps(data), headers=headers)
# 检查请求是否成功
if response.status_code == 200:
logging.info("请求成功!响应内容:%s", response.json())
print(f"请求成功!响应内容:\n{i}\n{response.json()}")
else:
logging.error("请求失败,状态码:%s", response.status_code)
print(f"请求失败,状态码:{response.status_code}")
except requests.exceptions.RequestException as e:
logging.error("请求异常:%s", str(e))
print(f"请求异常:{e}")
except Exception as e:
logging.error("发生错误:%s", str(e))
print(f"发生错误:{e}")
默认查看显存:
nvidia-smi
动态查看显存:
watch -n 0.5 nvidia-smi
如下所示:
Notice: 这里一定要注意,这个互动接口,虽然文档内明确表示
但是实践证明,就算interactive_mode = False ,调用api的过程中,服务器上的显存也会一直增长,最终导致显存爆炸,慎用!!另外,completions接口不会有这个问题。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。