ONNX推理加速技术文档

2,162次阅读
没有评论

零、前言

趁着端午假期,整理下之前记录的笔记。还是那句话,好记性不如烂笔头,写点文章既是输出也是输入~

一、模型文件转换

1.1 pth文件转onnx

pytorch框架中集成了onnx模块,属于官方支持,onnx也覆盖了pytorch框架中的大部分算子。因此将pth模型文件转换为onnx文件非常简单。以下是一个代码示例。需要注意的是,在转换之前,需要对pth模型的输入size进行冻结。比如:

batch_size = 1
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)

输入一旦冻结后,就只会有固定的batch_size,在使用转换后的onnx文件进行模型推理时,推理时输入的batch_size必须和冻结时保持一致。对于这个示例,你只能batch_size=1进行推理。如果你需要在推理时采用不同的batch_size,比如10,你只能在保存onnx模型之前修改冻结的输入节点,代码如下:

batch_size = 10
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)

这样,你就拥有了一个bacth_size=10的onnx模型。导出onnx文件,只需要使用torch.onnx.export()函数,代码如下:

    model_name = model_path.split("/")[-1].split(".")[0]
    model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"

    dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
    # dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
    torch.onnx.export(net, dummy_input, model_path,
                      verbose=False, input_names=['input'],
                      output_names=['scores', 'boxes'])

完整的转换代码:

# -*- coding: utf-8 -*-
"""
This code is used to convert the pytorch model into an onnx format model.
"""
import argparse
import sys
import torch.onnx
from models.ulfd.lib.ssd.config.fd_config import define_img_size

input_img_size = 320  # define input size ,default optional(128/160/320/480/640/1280)
define_img_size(input_img_size)
from models.ulfd.lib.ssd.mb_tiny_RFB_fd import create_Mb_Tiny_RFB_fd
from models.ulfd.lib.ssd.mb_tiny_fd import create_mb_tiny_fd


def get_args():
    parser = argparse.ArgumentParser(description='convert model to onnx')
    parser.add_argument("--net", dest='net_type', default="RFB",
                        type=str, help='net type.')
    parser.add_argument('--batch', dest='batch_size', default=1,
                        type=int, help='batch size for input.')
    args_ = parser.parse_args()

    return args_


if __name__ == '__main__':

    # net_type = "slim"  # inference faster,lower precision
    args = get_args()

    net_type = args.net_type  # inference lower,higher precision
    batch_size = args.batch_size

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    label_path = "models/ulfd/voc-model-labels.txt"
    class_names = [name.strip() for name in open(label_path).readlines()]
    num_classes = len(class_names)

    if net_type == 'slim':
        model_path = "baseline/ulfd/version-slim-320.pth"
        # model_path = "models/pretrained/version-slim-640.pth"
        net = create_mb_tiny_fd(len(class_names), is_test=True, device=device)
    elif net_type == 'RFB':
        model_path = "baseline/ulfd/version-RFB-320.pth"
        # model_path = "models/pretrained/version-RFB-640.pth"
        net = create_Mb_Tiny_RFB_fd(len(class_names), is_test=True, device=device)

    else:
        print("unsupport network type.")
        sys.exit(1)
    net.load(model_path)
    net.eval()
    net.to(device)

    model_name = model_path.split("/")[-1].split(".")[0]
    model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"

    dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
    # dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
    torch.onnx.export(net, dummy_input, model_path,
                      verbose=False, input_names=['input'],
                      output_names=['scores', 'boxes'])
    print('onnx model saved ', model_path)

    """
    PYTHONPATH=. python3 inference/ulfd/pth_to_onnx.py --net RFB --batch 16

    PYTHONPATH=. python inference/ulfd/pth_to_onnx.py --net RFB --batch 3
    """

1.2 pb文件转onnx

pb文件转onnx可以使用tf2onnx库,但必须说明的是,TensorFlow并没有官方支持onnx,tf2onnx是一个第三方库。格式转化onnx格式文件将tensorflow的pb文件转化为onnx格式的文件 安装tf2onnx。 参考:tensorrt-cubelab-docs tf2onnx安装

pip install tf2onnx

格式转化指令:

python -m tf2onnx.convert 
--input ./checkpoints/new_model.pb 
--inputs intent_network/inputs:0,intent_network/seq_len:0 
--outputs logits:0 
--output ./pb_models/model.onnx 
--fold_const 

# SAVE_MODEL
保存为save_model
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphC++onverter(input_saved_model_dir=input_saved_model_dir)
converter.convert()
converter.save(output_saved_model_dir)

python -m tf2onnx.convert --saved_model saved_model_dir --output model.onnx

# .pb 文件
python -m tf2onnx.convert --input frozen_graph.pb  --inputs X:0,X1:0 --outputs output:0 --output model.onnx --fold_const

# .ckpt 文件
python -m tf2onnx.convert --checkpoint checkpoint.meta  --inputs X:0 --outputs output:0 --output model.onnx --fold_const

1.3 onnx转pb文件

有时候,我们需要对模型进行跨框架的转换,比如你用pytorch训练了一个模型,但需要集成到TensorFlow中以便和其他的模型保持一致,方便部署。此时,你就可以通过将pth转换成onnx,然后再将onnx转换成pb文件,如果转换成功,那么你就可以在TensorFlow使用pb文件进行推理了。之所以强调如果,是因为TensorFlow并没有官方支持onnx,有可能会因为一些算子不兼容的问题导致转换后的pb文件在TF推理时出问题。 将onnx转换pb文件可以使用onnx-tf库,安装

pip install onnx-tf

完整的转换代码:

# -*- coding: utf-8 -*-
"""
@File  : onnx_to_pb.py
@Author: qiuyanjun
@Date  : 2020-01-10 19:22
@Desc  : 
"""
import cv2
import numpy as np
import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf


model = onnx.load('models/onnx/version-RFB-320.onnx')
tf_rep = prepare(model)

img = cv2.imread('imgs/1.jpg')
image = cv2.resize(img, (320, 240))
# 测试是否能推理
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
image = np.transpose(image, [2, 0, 1])
image = np.expand_dims(image, axis=0)
image = image.astype(np.float32)
output = tf_rep.run(image)

print("output mat: \n", output)
print("output type ", type(output))

# 建立Session并获取输入输出节点信息
with tf.Session() as persisted_sess:
    print("load graph")
    persisted_sess.graph.as_default()
    tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
    inp = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.inputs[0]].name
    )
    print('input_name: ', tf_rep.tensor_dict[tf_rep.inputs[0]].name)
    print('input_names: ', tf_rep.inputs)
    out = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[0]].name
    )
    print('output_name_0: ', tf_rep.tensor_dict[tf_rep.outputs[0]].name)
    print('output_name_1: ', tf_rep.tensor_dict[tf_rep.outputs[1]].name)
    print('output_names: ', tf_rep.outputs)
    res = persisted_sess.run(out, {inp: image})
    print(res)
    print("result is ", res)

# 保存成pb文件
tf_rep.export_graph('version-RFB-320.pb')
print('onnx to pb done.')

"""cmd
PYTHONPATH=. python3 onnx_to_pb.py
"""

二、直接使用onnx进行推理

onnx文件可以直接进行推理,这时的代码就已经与框架无关了,可以与训练阶段解耦。但是,为了推理的顺利进行,你依然需要为onnx选择一个后端,以TensorFlow为例。

import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf

...
# 包装一个TF后端
predictor = onnx.load(onnx_path)
onnx.checker.check_model(predictor)
onnx.helper.printable_graph(predictor.graph)
tf_rep = prepare(predictor, device="CUDA:0")  # default CPU

# 使用TF进行预测
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)  # defalut 0.5
tfconfig = tf.ConfigProto(allow_soft_placement=True, gpu_options=gpu_options)

... 
with tf.Session(config=tfconfig) as persisted_sess:
    persisted_sess.graph.as_default()
    tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
    tf_input = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.inputs[0]].name
    )
    tf_scores = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[0]].name
    )
    tf_boxes = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[1]].name
    )

    for file_path in listdir:
        ...
        confidences, boxes = persisted_sess.run([tf_scores, tf_boxes], {tf_input: image})
        ...

三、使用onnxruntime加速推理

事实上,我们可以更高效地使用onnx。onnxruntime是一个对onnx模型提供推理加速的库,支持CPU和GPU加速,GPU加速版本为onnxruntime-gpu,默认版本为CPU加速。安装:

pip install onnxruntime  # CPU
pip install onnxruntime-gpu # GPU

使用onnxruntime对onnx模型加速非常简单,只需要几行代码。这里给出一个示例:

import onnxruntime as ort
class NLFDOnnxCpuInferBase:
    """only support in CPU and accelerate with onnxruntime."""

    __metaclass__ = ABCMeta
   ...
   def __init__(self,
                 onnx_path=ONNX_PATH):
        """pytorch和onnx可以很好地结合
        :param onnx_path: .onnx文件路径
        """
        self._onnx_path = onnx_path
        # 使用onnx模型初始化ort的session
        self._ort_session = ort.InferenceSession(self._onnx_path)
        self._input_img = self._ort_session.get_inputs()[0].name
   ...

   # 使用run推理
   def _detect_img_utils(self, img: np.ndarray):
        """batch is ok."""
        feed_dict = {self._input_img: img}
        scores_before_nms, rois_before_nms = \
            self._ort_session.run(None,input_feed=feed_dict)
        return rois_before_nms, scores_before_nms

onnxruntime会自动帮你检查onnx中的无关节点并删除,也利用了一些加速库优化推理图,从而加速推理。一些log:

 python3 inference/ulfd/onnx_cpu_infer.py
2020-01-16 12:03:49.259044 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259478 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259492 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259501 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.1.num_batches_tracked'. It is not used by any node and should be removed from the model.

四、实验结果

在人脸检测ULFD模型上,未使用onnxruntime加速,对于320×240分辨率的图片,在CPU上需要跑要50~60ms;使用onnxruntime加速后,在CPU需要8~11ms.

五、请优雅地使用Numpy

在图像处理中经常会出现归一化处理,即使在推理的时候也需要。而在推理时需要考虑性能问题,最近发现numpy的张量计算的不同方式,会对性能有很大影响。如果你的均值化处理中的每个通道减去的均值都是一样的比如127.

# 普通的做法是:(请不要使用这种做法)
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
# 实际上会由于numpy的广播运算消耗更多的时间
# 你应该采用:(保证数据类型的一致以及减去一个常量的效率更高)
image = (image - 127.) / 128.
  • 实验代码 相同均值
# coding: utf-8
import cv2
import time
import numpy as np


if __name__ == '__main__':
    test_w, test_h = 500, 500

    test_path = 'logs/test0.jpg'
    test_img = cv2.imread(test_path)
    resize_img = cv2.resize(test_img, (test_w, test_h))


    test_count = 1000
    print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))

    t1 = time.time()
    image_mean = np.array([127, 127, 127])
    for _ in range(test_count):
        image = (resize_img - image_mean) / 128
    t2 = time.time()
    print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
        (t2-t1), (t2-t1)*1000/test_count
    ))
    t3 = time.time()
    for _ in range(test_count):
        image = (resize_img - 127.) / 128.
    t4 = time.time()
    print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
        (t4 - t3), (t4 - t3) * 1000 / test_count
    ))

实验结果:

但是当你确实要对不同的通道用到不同的均值时呢? 也请你这样做,以下是另一个测试结果。

  • 实验代码 不同均值
# coding: utf-8
import cv2
import time
import numpy as np


if __name__ == '__main__':
    test_w, test_h = 100, 100

    test_path = 'logs/test0.jpg'
    test_img = cv2.imread(test_path)
    resize_img = cv2.resize(test_img, (test_w, test_h))


    test_count = 100
    print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))
    print('-'*100)
    t1 = time.time()
    image_mean = np.array([127, 120, 107])
    for _ in range(test_count):
        image = (resize_img - image_mean) / 128
    t2 = time.time()
    print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
        (t2 - t1), (t2 - t1) * 1000 / test_count
    ))
    t3 = time.time()
    image = np.zeros_like(resize_img)
    for _ in range(test_count):
        image[:, :, 0] = (resize_img[:, :, 0] - 127.) / 128.
        image[:, :, 1] = (resize_img[:, :, 1] - 120.) / 128.
        image[:, :, 2] = (resize_img[:, :, 2] - 107.) / 128.
    t4 = time.time()
    print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
        (t4 - t3), (t4 - t3) * 1000 / test_count
    ))

实验结果

简单来说就是,只要你愿意动手修改几行代码,就能带来5ms~15ms的性能提升。这比采用TensorRT/ONNX等各种加速工具要简单太多了。

六、其他参考

TensorFlow+TensorRT C++

正文完
可以使用微信扫码关注公众号(ID:xzluomor)
post-qrcode
 0
评论(没有评论)

文心AIGC

2024 年 4 月
1234567
891011121314
15161718192021
22232425262728
2930  
文心AIGC
文心AIGC
人工智能ChatGPT,AIGC指利用人工智能技术来生成内容,其中包括文字、语音、代码、图像、视频、机器人动作等等。被认为是继PGC、UGC之后的新型内容创作方式。AIGC作为元宇宙的新方向,近几年迭代速度呈现指数级爆发,谷歌、Meta、百度等平台型巨头持续布局
文章搜索
热门文章
潞晨尤洋:日常办公没必要上私有模型,这三类企业才需要 | MEET2026

潞晨尤洋:日常办公没必要上私有模型,这三类企业才需要 | MEET2026

潞晨尤洋:日常办公没必要上私有模型,这三类企业才需要 | MEET2026 Jay 2025-12-22 09...
反超Nano Banana!OpenAI旗舰图像生成模型上线

反超Nano Banana!OpenAI旗舰图像生成模型上线

反超Nano Banana!OpenAI旗舰图像生成模型上线 Jay 2025-12-17 10:25:43 ...
“昆山杯”第二十七届清华大学创业大赛决赛举行

“昆山杯”第二十七届清华大学创业大赛决赛举行

“昆山杯”第二十七届清华大学创业大赛决赛举行 一水 2025-12-22 17:04:24 来源:量子位 本届...
人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态

人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态

人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态 量子位的朋友们 2025-...
最新评论
ufabet ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง
tornado crypto mixer tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.
ดูบอลสด ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.
ดูบอลสด ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.
ดูบอลสด ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.
ดูบอลสด ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.
Obrazy Sztuka Nowoczesna Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.
ufabet ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.
ufabet ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!
ufabet ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.
热评文章
反超Nano Banana!OpenAI旗舰图像生成模型上线

反超Nano Banana!OpenAI旗舰图像生成模型上线

反超Nano Banana!OpenAI旗舰图像生成模型上线 Jay 2025-12-17 10:25:43 ...
英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它

英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它

英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离...
英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它

英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离不开它

英伟达护城河又宽了!低调收购开源算力调度王牌工具,全球过半顶级超算在用,Thinking Machines也离...
是个公司都在用AI Agent,但大家真的用明白了吗| MEET2026圆桌论坛

是个公司都在用AI Agent,但大家真的用明白了吗| MEET2026圆桌论坛

是个公司都在用AI Agent,但大家真的用明白了吗| MEET2026圆桌论坛 一水 2025-12-17 ...
人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态

人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态

人车家全生态持续破圈,小米宣布对开发者开放小米MiMo大模型、CarIoT硬件生态 量子位的朋友们 2025-...