DPUCZDX8Gの各アーキテクチャで物体検出の速度を測定してみた

DPU(DPUCZDX8G)は複数のアーキテクチャがあります。

「B512,B800,B1024,B1600,B2304,B3136,B4096」の8つの実行速度を比較しました。

KR260でYOLOXの物体検出のDPU実行速度を比較しています。

DPUCZDX8Gの各アーキテクチャで物体検出の速度を測定してみた。
各アーキテクチャとモデルの作り方
各アーキテクチャのYOLOXの結果
まとめ

DPUCZDX8Gの各アーキテクチャで物体検出の速度を測定してみた。

DPU(DPUCZDX8G)は複数のアーキテクチャがあります。

「B512,B800,B1024,B1600,B2304,B3196,B4096」の8つの実行速度を比較しました。

KR260でYOLOXの物体検出のDPU実行速度を比較しています。

基本的に150MHzのクロックを使い、一部300MHzでも確認しています。

基本的にサイズが大きくなるほど、推論（inference）時間が短くなっています。

モデルの作り方から、実行プログラムまで紹介していきます。

各アーキテクチャとモデルの作り方

各アーキテクチャとモデルの作り方は下記記事を参考ください。

YOLOXの物体検出を実行するために「.bit」「.xclbin」「.hwh」と「.model」が必要になります。

DPUのIPを合成して、PYNQで動かしてみた

DPUのIPを合成して、PYNQで動かしてみた

VivadoとVitisを利用して、DPUのIPを合成したプロジェクトを作りました。KR260でPYNQ上で作成したDPUを使い、Vitis AIの物体検出(YOLOv3)をしています。その上でKR260にて、GPIO(PWM)も一緒に動か...

PYNQでYOLOX-nano + PyTorchの物体検出してみた

PYNQでYOLOX-nano + PyTorchの物体検出してみた

KR260のDPUを使って、物体検出をしました。YOLOX-nano+Pytorchの軽量のモデルを使用してます。YOLOv3とYOLOXの実行速度の比較までした旨を紹介します。PYNQでYOLOX-nano + PyTorchの物体検出し...

作成した「B512,B800,B1024,B1600,B2304,B3196,B4096」の各ファイルを下記に置いています。

B512とB3196とB4096に関してはDPUクロックを150MHz→300MHzにしたものも作りました。

https://github.com/iotengineer22/AMD-Pervasive-AI-Developer-Contest/tree/main/jupyter_notebooks/pynq-benchmark

DPUの設定(dpu_conf.vh)に関しては、基本的にデフォルトです。

DPUのサイズ箇所を各アーキテクチャに変更しています。

/*====== Architecture Options ======*/
// |------------------------------------------------------|
// | Support 8 DPU size
// | It relates to model. if change, must update model
// +------------------------------------------------------+
// | `define B512               
// +------------------------------------------------------+
// | `define B800                 
// +------------------------------------------------------+
// | `define B1024                 
// +------------------------------------------------------+
// | `define B1152                 
// +------------------------------------------------------+
// | `define B1600                 
// +------------------------------------------------------+
// | `define B2304                 
// +------------------------------------------------------+
// | `define B3136                 
// +------------------------------------------------------+
// | `define B4096                 
// |------------------------------------------------------|

`define B3136

/*====== Architecture Options ======*/

// |------------------------------------------------------|

// | Support 8 DPU size

// | It relates to model. if change, must update model

// +------------------------------------------------------+

// | `define B512

// +------------------------------------------------------+

// | `define B800

// +------------------------------------------------------+

// | `define B1024

// +------------------------------------------------------+

// | `define B1152

// +------------------------------------------------------+

// | `define B1600

// +------------------------------------------------------+

// | `define B2304

// +------------------------------------------------------+

// | `define B3136

// +------------------------------------------------------+

// | `define B4096

// |------------------------------------------------------|

`define B3136

またDPUサイズが大きくになるにつれ、BRAMだけでは容量が足りなくなりました。

そのためB2304からURAMをENABLEにしています。

// |------------------------------------------------------|
// | If the FPGA has Uram. You can define URAM_EN parameter               
// | if change, Don't need update model
// +------------------------------------------------------+
// | for zcu104 : `define URAM_ENABLE               
// +------------------------------------------------------+
// | for zcu102 : `define URAM_DISABLE                 
// |------------------------------------------------------|

//`define URAM_DISABLE 
`define URAM_ENABLE

// |------------------------------------------------------|

// | If the FPGA has Uram. You can define URAM_EN parameter

// | if change, Don't need update model

// +------------------------------------------------------+

// | for zcu104 : `define URAM_ENABLE

// +------------------------------------------------------+

// | for zcu102 : `define URAM_DISABLE

// |------------------------------------------------------|

//`define URAM_DISABLE

`define URAM_ENABLE

B4096_300MHzに関しては、デフォルトのパラメータではビルド出来なかったです。

下記で使用したパラメータを利用しています。

DPUのIPを合成して、PYNQで動かしてみた

DPUのIPを合成して、PYNQで動かしてみた

各アーキテクチャのYOLOXの結果

1枚の写真をYOLOXで物体検出したときの、DPUでの推論（inference）時間を確認しました。

下記のように前処理・後処理は除いた時間です。

    # Fetch data to DPU and trigger it
    dpu_start = time.time()
    job_id = dpu.execute_async(input_data, output_data)
    dpu.wait(job_id)
    dpu_end = time.time()

# Fetch data to DPU and trigger it

dpu_start = time.time()

job_id = dpu.execute_async(input_data, output_data)

dpu.wait(job_id)

dpu_end = time.time()

代表的な点をまとめると、(当たり前ですが)下記内容になりました。

アーキテクチャのサイズを大きくすると、基本的には速くなる。
DPUへのクロックを速くしても、基本的には速くなる。

細かいところを見ていくと下記内容です

B3136やB4096などサイズ大きくになるにつれ、実行速度の差は小さくなりました。
B512に関しては150MHz→300MHzにすることで、実行速度が速くなりました。(B3136/B4096に関してはB512と比べると、影響は小さかったです。)

KR260上での各アーキテクチャの速度の結果、使用率(Utilization)含めて貼り付けておきます。

DPUのリソース使用量などの詳細に関しては、下記の公式ドキュメントでも記載されています。

AMD Technical Information Portal

B512_150MHz

B512_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0078 seconds
DPU execution time: 0.0348 seconds
Post-process time: 0.0291 seconds
Total run time: 0.0717 seconds
Performance: 13.942717336382735 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0078 seconds

DPU execution time: 0.0348 seconds

Post-process time: 0.0291 seconds

Total run time: 0.0717 seconds

Performance: 13.942717336382735 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B800_150MHz

B800_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0079 seconds
DPU execution time: 0.0300 seconds
Post-process time: 0.0288 seconds
Total run time: 0.0667 seconds
Performance: 14.988864588247067 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0079 seconds

DPU execution time: 0.0300 seconds

Post-process time: 0.0288 seconds

Total run time: 0.0667 seconds

Performance: 14.988864588247067 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B1024_150MHz

B1024_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0077 seconds
DPU execution time: 0.0255 seconds
Post-process time: 0.0289 seconds
Total run time: 0.0620 seconds
Performance: 16.12263694022679 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0077 seconds

DPU execution time: 0.0255 seconds

Post-process time: 0.0289 seconds

Total run time: 0.0620 seconds

Performance: 16.12263694022679 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B1152_150MHz

B1152_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0077 seconds
DPU execution time: 0.0276 seconds
Post-process time: 0.0291 seconds
Total run time: 0.0644 seconds
Performance: 15.529397825893783 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0077 seconds

DPU execution time: 0.0276 seconds

Post-process time: 0.0291 seconds

Total run time: 0.0644 seconds

Performance: 15.529397825893783 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B1600_150MHz

B1600_150MHzの結果です。

DPUのデフォルトパラメータだと、この時点でBRAMの容量が限界でした。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0078 seconds
DPU execution time: 0.0229 seconds
Post-process time: 0.0290 seconds
Total run time: 0.0597 seconds
Performance: 16.762933980248828 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0078 seconds

DPU execution time: 0.0229 seconds

Post-process time: 0.0290 seconds

Total run time: 0.0597 seconds

Performance: 16.762933980248828 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B2304_150MHz

B2304_150MHzの結果です。

B2304からDPU設定でURAMをENABLEにしています。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0079 seconds
DPU execution time: 0.0207 seconds
Post-process time: 0.0290 seconds
Total run time: 0.0575 seconds
Performance: 17.37865654573479 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0079 seconds

DPU execution time: 0.0207 seconds

Post-process time: 0.0290 seconds

Total run time: 0.0575 seconds

Performance: 17.37865654573479 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B3136_150MHz

B3136_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0080 seconds
DPU execution time: 0.0195 seconds
Post-process time: 0.0307 seconds
Total run time: 0.0583 seconds
Performance: 17.16086428188584 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0080 seconds

DPU execution time: 0.0195 seconds

Post-process time: 0.0307 seconds

Total run time: 0.0583 seconds

Performance: 17.16086428188584 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B4096_150MHz

B4096_150MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0080 seconds
DPU execution time: 0.0170 seconds
Post-process time: 0.0292 seconds
Total run time: 0.0542 seconds
Performance: 18.46564028510925 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0080 seconds

DPU execution time: 0.0170 seconds

Post-process time: 0.0292 seconds

Total run time: 0.0542 seconds

Performance: 18.46564028510925 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B512_300MHz

B512_300MHzの結果です

クロックを上げただけですので、B512_150MHzの使用量と大きくは変わりありません。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0081 seconds
DPU execution time: 0.0224 seconds
Post-process time: 0.0290 seconds
Total run time: 0.0596 seconds
Performance: 16.7869524324108 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0081 seconds

DPU execution time: 0.0224 seconds

Post-process time: 0.0290 seconds

Total run time: 0.0596 seconds

Performance: 16.7869524324108 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B3136_300MHz

B3136_300MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0077 seconds
DPU execution time: 0.0147 seconds
Post-process time: 0.0290 seconds
Total run time: 0.0514 seconds
Performance: 19.447244941486304 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))

Details of detected objects: [49. 60.]

Pre-processing time: 0.0077 seconds

DPU execution time: 0.0147 seconds

Post-process time: 0.0290 seconds

Total run time: 0.0514 seconds

Performance: 19.447244941486304 FPS

(array([[ 458.11553955, 125.8078537 , 821.88452148, 489.57681274],

[ 40.24644089, 0. , 1239.75366211, 720. ]]),

array([0.56179011, 0.11786249]),

array([49., 60.]))

B4096_300MHz

B4096_300MHzの結果です。

Details of detected objects: [49. 60.]
Pre-processing time: 0.0079 seconds
DPU execution time: 0.0137 seconds
Post-process time: 0.0291 seconds
Total run time: 0.0507 seconds
Performance: 19.72305087933791 FPS

(array([[ 458.11553955,  125.8078537 ,  821.88452148,  489.57681274],
        [  40.24644089,    0.        , 1239.75366211,  720.        ]]),
 array([0.56179011, 0.11786249]),
 array([49., 60.]))