當(dāng)前位置：首頁 > news >正文

wordpress站群系統(tǒng)南京網(wǎng)站建設(shè)

news 2025/7/5 13:35:58

wordpress站群系統(tǒng),南京網(wǎng)站建設(shè),西安模板網(wǎng)站建設(shè),編程平臺(tái)有哪些文章目錄摘要安裝基礎(chǔ)環(huán)境新建虛擬環(huán)境安裝pytorch安裝openmim、mmengine、mmcv安裝 MMDetection驗(yàn)證安裝配置OV-DINO環(huán)境 MMDetection的MM-Grounding-DINO詳細(xì)介紹測試結(jié)果Zero-Shot COCO 結(jié)果與模型Zero-Shot LVIS ResultsZero-Shot ODinW（野生環(huán)境下的目標(biāo)檢測&…

文章目錄

摘要
安裝基礎(chǔ)環(huán)境
- 新建虛擬環(huán)境
- 安裝pytorch
- 安裝openmim、mmengine、mmcv
- 安裝 MMDetection
- 驗(yàn)證安裝
- 配置OV-DINO環(huán)境
MMDetection的MM-Grounding-DINO詳細(xì)介紹
- 測試結(jié)果
- - Zero-Shot COCO 結(jié)果與模型
  - Zero-Shot LVIS Results
  - Zero-Shot ODinW（野生環(huán)境下的目標(biāo)檢測）結(jié)果
  - - ODinW13的結(jié)果和模型
    - ODinW35的結(jié)果和模型
  - 零樣本指代表達(dá)式理解結(jié)果
  - 零樣本描述檢測數(shù)據(jù)集（DOD）
  - Pretrain Flickr30k Results
  - 通過微調(diào)驗(yàn)證預(yù)訓(xùn)練模型的泛化能力
  - - RTTS
    - RUOD
    - Brain Tumor
    - Cityscapes
    - People in Painting
    - COCO
    - LVIS 1.0
    - RefEXP
    - - RefCOCO
      - RefCOCO+
      - RefCOCOg
      - gRefCOCO
MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理
- 用到的數(shù)據(jù)集
- - 1 Objects365 v1
  - 2 COCO 2017
  - 3 GoldG
  - 4 GRIT-20M
  - 5 V3Det
  - 6 數(shù)據(jù)切分和可視化
MM-GDINO-L 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理
- 用到的數(shù)據(jù)集
- - 1 Object365 v2
  - 2 OpenImages v6
  - 3 V3Det
  - 4 LVIS 1.0
  - 5 COCO2017 OD
  - 6 GoldG
  - 7 COCO2014 VG
  - 8 Referring Expression Comprehension
  - 9 GRIT-20M
- 評測數(shù)據(jù)集準(zhǔn)備
- - 1 COCO 2017
  - 2 LVIS 1.0
  - 3 ODinW
  - 4 DOD
  - 5 Flickr30k Entities
  - 6 Referring Expression Comprehension
- 微調(diào)數(shù)據(jù)集準(zhǔn)備
- - 1 COCO 2017
  - 2 LVIS 1.0
  - 3 RTTS
  - 4 RUOD
  - 5 Brain Tumor
  - 6 Cityscapes
  - 7 People in Painting
  - 8 Referring Expression Comprehension
推理與微調(diào)
- MM Grounding DINO-T 模型權(quán)重下載
- 推理
- 評測
- 評測數(shù)據(jù)集結(jié)果可視化
- 模型訓(xùn)練
- - 預(yù)訓(xùn)練自定義格式說明
- 自定義數(shù)據(jù)集微調(diào)訓(xùn)練案例
- - 1 數(shù)據(jù)準(zhǔn)備
  - 2 配置準(zhǔn)備
  - 3 可視化和 Zero-Shot 評估
  - 4 模型訓(xùn)練
- 模型自訓(xùn)練偽標(biāo)簽迭代生成和優(yōu)化 pipeline
- - 1 目標(biāo)檢測格式
  - 2 Phrase Grounding 格式

摘要

基礎(chǔ)環(huán)境：Ubuntu 22.04、CUDA 11.7

安裝基礎(chǔ)環(huán)境

新建虛擬環(huán)境

conda create --name openmm python=3.9

在這里插入圖片描述
輸入y。

安裝pytorch

conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia

在這里插入圖片描述

安裝openmim、mmengine、mmcv

pip install -U openmim
mim install mmengine
mim install "mmcv==2.0.0rc4"

這里不要用>=，如果使用了默認(rèn)安裝最新版本，不兼容！所以，使用==安裝最低要求的版本即可！！
注意： 在 MMCV-v2.x 中，mmcv-full 改名為 mmcv，如果你想安裝不包含 CUDA 算子精簡版，可以通過 mim install "mmcv-lite>=2.0.0rc1" 來安裝。
在這里插入圖片描述

安裝編譯mmcv時(shí)間很長，如上圖，如果不想安裝編譯，可以使用編譯好的庫，鏈接：
https://mmcv.readthedocs.io/en/latest/get_started/installation.html
在這里插入圖片描述
安裝本機(jī)的環(huán)境安裝編譯！

安裝 MMDetection

下載MMdetection，代碼鏈接：https://github.com/open-mmlab/mmdetection, 下載后解壓進(jìn)入到根目錄?；蛘咧苯邮褂胓it獲取源碼，如下：

git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection
pip install -v -e .
# "-v" 指詳細(xì)說明，或更多的輸出
# "-e" 表示在可編輯模式下安裝項(xiàng)目，因此對代碼所做的任何本地修改都會(huì)生效，從而無需重新安裝。

或者將 mmdet 作為依賴或第三方 Python 包，使用 MIM 安裝：

mim install mmdet

驗(yàn)證安裝

為了驗(yàn)證 MMDetection 是否安裝正確，我們提供了一些示例代碼來執(zhí)行模型推理。

步驟 1. 我們需要下載配置文件和模型權(quán)重文件。

mim download mmdet --config rtmdet_tiny_8xb32-300e_coco --dest .

下載將需要幾秒鐘或更長時(shí)間，這取決于你的網(wǎng)絡(luò)環(huán)境。完成后，你會(huì)在當(dāng)前文件夾中發(fā)現(xiàn)兩個(gè)文件 rtmdet_tiny_8xb32-300e_coco.py 和 rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth。

步驟 2. 推理驗(yàn)證。

如果你通過源碼安裝的 MMDetection，那么直接運(yùn)行以下命令進(jìn)行驗(yàn)證：

python demo/image_demo.py demo/demo.jpg rtmdet_tiny_8xb32-300e_coco.py --weights rtmdet_tiny_8xb32-300e_coco_20220902_112414-78e30dcc.pth --device cpu

你會(huì)在當(dāng)前文件夾中的 outputs/vis 文件夾中看到一個(gè)新的圖像 demo.jpg，圖像中包含有網(wǎng)絡(luò)預(yù)測的檢測框。

如果你通過 MIM 安裝的 MMDetection，那么可以打開你的 Python 解析器，復(fù)制并粘貼以下代碼：

from mmdet.apis import init_detector, inference_detector
import mmcv# 指定模型的配置文件和 checkpoint 文件路徑
config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'# 根據(jù)配置文件和 checkpoint 文件構(gòu)建模型
model = init_detector(config_file, checkpoint_file, device='cuda:0')# 測試單張圖片并展示結(jié)果
img = 'test.jpg'  # 或者 img = mmcv.imread(img)，這樣圖片僅會(huì)被讀一次
result = inference_detector(model, img)
# 在一個(gè)新的窗口中將結(jié)果可視化
model.show_result(img, result)
# 或者將可視化結(jié)果保存為圖片
model.show_result(img, result, out_file='result.jpg')# 測試視頻并展示結(jié)果
video = mmcv.VideoReader('video.mp4')
for frame in video:result = inference_detector(model, frame)model.show_result(frame, result, wait_time=1)

你將會(huì)看到一個(gè)包含 DetDataSample 的列表，預(yù)測結(jié)果在 pred_instance 里，包含有檢測框，類別和得分。

配置OV-DINO環(huán)境

安裝OV-DINO需要用到的庫文件，如下：
首先安裝numpy庫，默認(rèn)安裝是2.x的版本，不能用，需要切換到1.x的版本。命令如下：

pip install numpy==1.24.3

安裝其他的庫文件，命令如下：

pip install terminaltables
pip install pycocotools
pip install shapely
pip install scipy
pip install fairscale

安裝Transformer，由于用到了Bert，所以要安裝Transformer，安裝命令

pip install transformers

由于國內(nèi)不能直接鏈接huggingface，所以需要用到代理：設(shè)置方式如下：
在這里插入圖片描述
在image_demo腳本中加入代理鏈接，代碼如下：

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

然后，運(yùn)行命令：

python demo/image_demo.py demo/0016.jpg configs/mm_grounding_dino/grounding_dino_swin-b_pretrain_obj365_goldg_v3det.py --weights grounding_dino_swin-b_pretrain_all-f9818a7c.pth --texts 'the standing man . the squatting man' -c --device 'cuda:1'

在這里插入圖片描述

MMDetection的MM-Grounding-DINO詳細(xì)介紹

Grounding-DINO 是一種先進(jìn)的開放集檢測模型，能夠處理包括開放詞匯檢測（OVD）、短語定位（PG）和指代表達(dá)式理解（REC）在內(nèi)的多種視覺任務(wù)。由于其有效性，Grounding-DINO 已被廣泛采用為各種下游應(yīng)用的主流架構(gòu)。然而，盡管它意義重大，但由于訓(xùn)練代碼的不可用性，原始的 Grounding-DINO 模型缺乏全面的公共技術(shù)細(xì)節(jié)。為了彌補(bǔ)這一差距，我們推出了 MM-Grounding-DINO，這是一個(gè)基于 MMDetection 工具箱構(gòu)建的開源、全面且用戶友好的基線。它采用了豐富的視覺數(shù)據(jù)集進(jìn)行預(yù)訓(xùn)練，并利用各種檢測和定位數(shù)據(jù)集進(jìn)行微調(diào)。我們對每個(gè)報(bào)告的結(jié)果進(jìn)行了全面分析，并提供了詳細(xì)的設(shè)置以便復(fù)現(xiàn)。在提到的基準(zhǔn)測試上的廣泛實(shí)驗(yàn)表明，我們的 MM-Grounding-DINO-Tiny 優(yōu)于 Grounding-DINO-Tiny 基線。我們已將所有模型向研究界公開。

測試結(jié)果

Zero-Shot COCO 結(jié)果與模型

Model	Backbone	Style	COCO mAP	Pre-Train Data	Config	Download
GDINO-T	Swin-T	Zero-shot	46.7	O365
GDINO-T	Swin-T	Zero-shot	48.1	O365,GoldG
GDINO-T	Swin-T	Zero-shot	48.4	O365,GoldG,Cap4M	config	model
MM-GDINO-T	Swin-T	Zero-shot	48.5(+1.8)	O365	config
MM-GDINO-T	Swin-T	Zero-shot	50.4(+2.3)	O365,GoldG	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.5(+2.1)	O365,GoldG,GRIT	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.6(+2.2)	O365,GoldG,V3Det	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.4(+2.0)	O365,GoldG,GRIT,V3Det	config	model \| log
MM-GDINO-B	Swin-B	Zero-shot	52.5	O365,GoldG,V3Det	config	model \| log
MM-GDINO-B*	Swin-B	-	59.5	O365,ALL	config	model \| log
MM-GDINO-L	Swin-L	Zero-shot	53.0	O365V2,OpenImageV6,GoldG	config	model \| log
MM-GDINO-L*	Swin-L	-	60.3	O365V2,OpenImageV6,ALL	config	model \| log

這個(gè)*表示模型尚未完全訓(xùn)練。我們將在未來發(fā)布最終權(quán)重。
ALL: GoldG,V3det,COCO2017,LVISV1,COCO2014,GRIT,RefCOCO,RefCOCO+,RefCOCOg,gRefCOCO。

Zero-Shot LVIS Results

Model	MiniVal APr	MiniVal APc	MiniVal APf	MiniVal AP	Val1.0 APr	Val1.0 APc	Val1.0 APf	Val1.0 AP	Pre-Train Data
GDINO-T	18.8	24.2	34.7	28.8	10.1	15.3	29.9	20.1	O365,GoldG,Cap4M
MM-GDINO-T	28.1	30.2	42.0	35.7(+6.9)	17.1	22.4	36.5	27.0(+6.9)	O365,GoldG
MM-GDINO-T	26.6	32.4	41.8	36.5(+7.7)	17.3	22.6	36.4	27.1(+7.0)	O365,GoldG,GRIT
MM-GDINO-T	33.0	36.0	45.9	40.5(+11.7)	21.5	25.5	40.2	30.6(+10.5)	O365,GoldG,V3Det
MM-GDINO-T	34.2	37.4	46.2	41.4(+12.6)	23.6	27.6	40.5	31.9(+11.8)	O365,GoldG,GRIT,V3Det

MM-GDINO-T的配置文件是mini-lvis和lvis 1.0

Zero-Shot ODinW（野生環(huán)境下的目標(biāo)檢測）結(jié)果

ODinW13的結(jié)果和模型

Method	GDINO-T (O365,GoldG,Cap4M)	MM-GDINO-T (O365,GoldG)	MM-GDINO-T (O365,GoldG,GRIT)	MM-GDINO-T (O365,GoldG,V3Det)	MM-GDINO-T (O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone	0.173	0.133	0.155	0.177	0.151
Aquarium	0.195	0.252	0.261	0.266	0.283
CottontailRabbits	0.799	0.771	0.810	0.778	0.786
EgoHands	0.608	0.499	0.537	0.506	0.519
NorthAmericaMushrooms	0.507	0.331	0.462	0.669	0.767
Packages	0.687	0.707	0.687	0.710	0.706
PascalVOC	0.563	0.565	0.580	0.556	0.566
pistols	0.726	0.585	0.709	0.671	0.729
pothole	0.215	0.136	0.285	0.199	0.243
Raccoon	0.549	0.469	0.511	0.553	0.535
ShellfishOpenImages	0.393	0.321	0.437	0.519	0.488
thermalDogsAndPeople	0.657	0.556	0.603	0.493	0.542
VehiclesOpenImages	0.613	0.566	0.603	0.614	0.615
Average	0.514	0.453	0.511	0.516	0.533

MM-GDINO-T的配置文件是odinw13

ODinW35的結(jié)果和模型

Method	GDINO-T (O365,GoldG,Cap4M)	MM-GDINO-T (O365,GoldG)	MM-GDINO-T (O365,GoldG,GRIT)	MM-GDINO-T (O365,GoldG,V3Det)	MM-GDINO-T (O365,GoldG,GRIT,V3Det)
AerialMaritimeDrone_large	0.173	0.133	0.155	0.177	0.151
AerialMaritimeDrone_tiled	0.206	0.170	0.225	0.184	0.206
AmericanSignLanguageLetters	0.002	0.016	0.020	0.011	0.007
Aquarium	0.195	0.252	0.261	0.266	0.283
BCCD	0.161	0.069	0.118	0.083	0.077
boggleBoards	0.000	0.002	0.001	0.001	0.002
brackishUnderwater	0.021	0.033	0.021	0.025	0.025
ChessPieces	0.000	0.000	0.000	0.000	0.000
CottontailRabbits	0.806	0.771	0.810	0.778	0.786
dice	0.004	0.002	0.005	0.001	0.001
DroneControl	0.042	0.047	0.097	0.088	0.074
EgoHands_generic	0.608	0.527	0.537	0.506	0.519
EgoHands_specific	0.002	0.001	0.005	0.007	0.003
HardHatWorkers	0.046	0.048	0.070	0.070	0.108
MaskWearing	0.004	0.009	0.004	0.011	0.009
MountainDewCommercial	0.430	0.453	0.465	0.194	0.430
NorthAmericaMushrooms	0.471	0.331	0.462	0.669	0.767
openPoetryVision	0.000	0.001	0.000	0.000	0.000
OxfordPets_by_breed	0.003	0.002	0.004	0.006	0.004
OxfordPets_by_species	0.011	0.019	0.016	0.020	0.015
PKLot	0.001	0.004	0.002	0.008	0.007
Packages	0.695	0.707	0.687	0.710	0.706
PascalVOC	0.563	0.565	0.580	0.566	0.566
pistols	0.726	0.585	0.709	0.671	0.729
plantdoc	0.005	0.005	0.007	0.008	0.011
pothole	0.215	0.136	0.219	0.077	0.168
Raccoons	0.549	0.469	0.511	0.553	0.535
selfdrivingCar	0.089	0.091	0.076	0.094	0.083
ShellfishOpenImages	0.393	0.321	0.437	0.519	0.488
ThermalCheetah	0.087	0.063	0.081	0.030	0.045
thermalDogsAndPeople	0.657	0.556	0.603	0.493	0.543
UnoCards	0.006	0.012	0.010	0.009	0.005
VehiclesOpenImages	0.613	0.566	0.603	0.614	0.615
WildfireSmoke	0.134	0.106	0.154	0.042	0.127
websiteScreenshots	0.012	0.02	0.016	0.016	0.016
Average	0.227	0.202	0.228	0.214	0.284

MM-GDINO-T的配置文件是odinw35

零樣本指代表達(dá)式理解結(jié)果

Method	GDINO-T (O365,GoldG,Cap4M)	MM-GDINO-T (O365,GoldG)	MM-GDINO-T (O365,GoldG,GRIT)	MM-GDINO-T (O365,GoldG,V3Det)	MM-GDINO-T (O365,GoldG,GRIT,V3Det)
RefCOCO val @1,5,10	50.8/89.5/94.9	53.1/89.9/94.7	53.4/90.3/95.5	52.1/89.8/95.0	53.1/89.7/95.1
RefCOCO testA @1,5,10	57.4/91.3/95.6	59.7/91.5/95.9	58.8/91.70/96.2	58.4/86.8/95.6	59.1/91.0/95.5
RefCOCO testB @1,5,10	45.0/86.5/92.9	46.4/86.9/92.2	46.8/87.7/93.3	45.4/86.2/92.6	46.8/87.8/93.6
RefCOCO+ val @1,5,10	51.6/86.4/92.6	53.1/87.0/92.8	53.5/88.0/93.7	52.5/86.8/93.2	52.7/87.7/93.5
RefCOCO+ testA @1,5,10	57.3/86.7/92.7	58.9/87.3/92.9	59.0/88.1/93.7	58.1/86.7/93.5	58.7/87.2/93.1
RefCOCO+ testB @1,5,10	46.4/84.1/90.7	47.9/84.3/91.0	47.9/85.5/92.7	46.9/83.7/91.5	48.4/85.8/92.1
RefCOCOg val @1,5,10	60.4/92.1/96.2	61.2/92.6/96.1	62.7/93.3/97.0	61.7/92.9/96.6	62.9/93.3/97.2
RefCOCOg test @1,5,10	59.7/92.1/96.3	61.1/93.3/96.7	62.6/94.9/97.1	61.0/93.1/96.8	62.9/93.9/97.4

Method	thresh_score	GDINO-T (O365,GoldG,Cap4M)	MM-GDINO-T (O365,GoldG)	MM-GDINO-T (O365,GoldG,GRIT)	MM-GDINO-T (O365,GoldG,V3Det)	MM-GDINO-T (O365,GoldG,GRIT,V3Det)
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc	0.5	39.3/70.4				39.4/67.5
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc	0.6	40.5/83.8				40.6/83.1
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc	0.7	41.3/91.8	39.8/84.7	40.7/89.7	40.3/88.8	41.0/91.3
gRefCOCO val Pr@(F1=1, IoU≥0.5),N-acc	0.8	41.5/96.8				41.1/96.4
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc	0.5	31.9/70.4				33.1/69.5
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc	0.6	29.3/82.9				29.2/84.3
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc	0.7	27.2/90.2	26.3/89.0	26.0/91.9	25.4/91.8	26.1/93.0
gRefCOCO testA Pr@(F1=1, IoU≥0.5),N-acc	0.8	25.1/96.3				23.8/97.2
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc	0.5	30.9/72.5				33.0/69.6
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc	0.6	30.0/86.1				31.6/96.7
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc	0.7	29.7/93.5	31.3/84.8	30.6/90.2	30.7/89.9	30.4/92.3
gRefCOCO testB Pr@(F1=1, IoU≥0.5),N-acc	0.8	29.1/97.4				29.5/84.2

MM-GDINO-T的配置文件位于：refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp.py

零樣本描述檢測數(shù)據(jù)集（DOD）

pip install ddd-dataset

Method	mode	GDINO-T (O365,GoldG,Cap4M)	MM-GDINO-T (O365,GoldG)	MM-GDINO-T (O365,GoldG,GRIT)	MM-GDINO-T (O365,GoldG,V3Det)	MM-GDINO-T (O365,GoldG,GRIT,V3Det)
FULL/short/middle/long/very long	concat	17.2/18.0/18.7/14.8/16.3	15.6/17.3/16.7/14.3/13.1	17.0/17.7/18.0/15.7/15.7	16.2/17.4/16.8/14.9/15.4	17.5/23.4/18.3/14.7/13.8
FULL/short/middle/long/very long	parallel	22.3/28.2/24.8/19.1/13.9	21.7/24.7/24.0/20.2/13.7	22.5/25.6/25.1/20.5/14.9	22.3/25.6/24.5/20.6/14.7	22.9/28.1/25.4/20.4/14.4
PRES/short/middle/long/very long	concat	17.8/18.3/19.2/15.2/17.3	16.4/18.4/17.3/14.5/14.2	17.9/19.0/18.3/16.5/17.5	16.6/18.8/17.1/15.1/15.0	18.0/23.7/18.6/15.4/13.3
PRES/short/middle/long/very long	parallel	21.0/27.0/22.8/17.5/12.5	21.3/25.5/22.8/19.2/12.9	21.5/25.2/23.0/19.0/15.0	21.6/25.7/23.0/19.5/14.8	21.9/27.4/23.2/19.1/14.2
ABS/short/middle/long/very long	concat	15.4/17.1/16.4/13.6/14.9	13.4/13.4/14.5/13.5/11.9	14.5/13.1/16.7/13.6/13.3	14.8/12.5/15.6/14.3/15.8	15.9/22.2/17.1/12.5/14.4
ABS/short/middle/long/very long	parallel	26.0/32.0/33.0/23.6/15.5	22.8/22.2/28.7/22.9/14.7	25.6/26.8/33.9/24.5/14.7	24.1/24.9/30.7/23.8/14.7	26.0/30.3/34.1/23.9/14.6

注：

考慮到跨場景評估時(shí)間非常長且性能較低，因此暫時(shí)不支持。上述指標(biāo)是針對單場景（Intra-scenario）的。
concat是Grounding DINO的默認(rèn)推理模式，它將多個(gè)子句用點(diǎn)（.）連接起來形成一個(gè)單獨(dú)的句子進(jìn)行推理。另一方面，“parallel”模式則在for循環(huán)中對每個(gè)子句分別進(jìn)行推理。
MM-GDINO-T的配置文件是concat_dod：dod/grounding_dino_swin-t_pretrain_zeroshot_concat_dod.py和parallel_dod：dod/grounding_dino_swin-t_pretrain_zeroshot_parallel_dod.py

Pretrain Flickr30k Results

Model	Pre-Train Data	Val R@1	Val R@5	Val R@10	Test R@1	Test R@5	Test R@10
GLIP-T	O365,GoldG	84.9	94.9	96.3	85.6	95.4	96.7
GLIP-T	O365,GoldG,CC3M,SBU	85.3	95.5	96.9	86.0	95.9	97.2
GDINO-T	O365,GoldG,Cap4M	87.8	96.6	98.0	88.1	96.9	98.2
MM-GDINO-T	O365,GoldG	85.5	95.6	97.2	86.2	95.7	97.4
MM-GDINO-T	O365,GoldG,GRIT	86.7	95.8	97.6	87.0	96.2	97.7
MM-GDINO-T	O365,GoldG,V3Det	85.9	95.7	97.4	86.3	95.7	97.4
MM-GDINO-T	O365,GoldG,GRIT,V3Det	86.7	96.0	97.6	87.2	96.2	97.7

注：

@1,5,10指的是在預(yù)測的排名列表中，前1、5和10個(gè)位置的精確度。
MM-GDINO-T的配置文件位于：flickr30k/grounding_dino_swin-t-pretrain_flickr30k.py

通過微調(diào)驗(yàn)證預(yù)訓(xùn)練模型的泛化能力

RTTS

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	1x	48.1
Cascade R-CNN	R-50	1x	50.8
ATSS	R-50	1x	48.2
TOOD	R-50	1X	50.8
MM-GDINO(zero-shot)	Swin-T		49.8
MM-GDINO	Swin-T	1x	69.1

參考指標(biāo)來自 https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/rtts_dataset
MM-GDINO-T 配置文件是：rtts/grounding_dino_swin-t_finetune_8xb4_1x_rtts.py

RUOD

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	1x	52.4
Cascade R-CNN	R-50	1x	55.3
ATSS	R-50	1x	55.7
TOOD	R-50	1X	57.4
MM-GDINO(zero-shot)	Swin-T		29.8
MM-GDINO	Swin-T	1x	65.5

參考指標(biāo)來自 https://github.com/BIGWangYuDong/lqit/tree/main/configs/detection/ruod_dataset
MM-GDINO-T 配置文件位于：ruod/grounding_dino_swin-t_finetune_8xb4_1x_ruod.py

Brain Tumor

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	50e	43.5
Cascade R-CNN	R-50	50e	46.2
DINO	R-50	50e	46.4
Cascade-DINO	R-50	50e	48.6
MM-GDINO	Swin-T	50e	47.5

參考指標(biāo)來自 https://arxiv.org/abs/2307.11035
MM-GDINO-T 配置文件是：brain_tumor/grounding_dino_swin-t_finetune_8xb4_50e_brain_tumor.py

Cityscapes

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	50e	30.1
Cascade R-CNN	R-50	50e	31.8
DINO	R-50	50e	34.5
Cascade-DINO	R-50	50e	34.8
MM-GDINO(zero-shot)	Swin-T		34.2
MM-GDINO	Swin-T	50e	51.5

參考指標(biāo)來自 https://arxiv.org/abs/2307.11035
MM-GDINO-T 配置文件是：cityscapes/grounding_dino_swin-t_finetune_8xb4_50e_cityscapes.py

People in Painting

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	50e	17.0
Cascade R-CNN	R-50	50e	18.0
DINO	R-50	50e	12.0
Cascade-DINO	R-50	50e	13.4
MM-GDINO(zero-shot)	Swin-T		23.1
MM-GDINO	Swin-T	50e	38.9

參考指標(biāo)來自 https://arxiv.org/abs/2307.11035
MM-GDINO-T 配置文件是：people_in_painting/grounding_dino_swin-t_finetune_8xb4_50e_people_in_painting.py

COCO

(1) Closed-set performance

Architecture	Backbone	Lr schd	box AP
Faster R-CNN	R-50	1x	37.4
Cascade R-CNN	R-50	1x	40.3
ATSS	R-50	1x	39.4
TOOD	R-50	1X	42.4
DINO	R-50	1X	50.1
GLIP(zero-shot)	Swin-T		46.6
GDINO(zero-shot)	Swin-T		48.5
MM-GDINO(zero-shot)	Swin-T		50.4
GLIP	Swin-T	1x	55.4
GDINO	Swin-T	1x	58.1
MM-GDINO	Swin-T	1x	58.2

MM-GDINO-T 配置文件是：coco/grounding_dino_swin-t_finetune_16xb4_1x_coco.py

(2) 開放集繼續(xù)預(yù)訓(xùn)練性能

Architecture	Backbone	Lr schd	box AP
GLIP(zero-shot)	Swin-T		46.7
GDINO(zero-shot)	Swin-T		48.5
MM-GDINO(zero-shot)	Swin-T		50.4
MM-GDINO	Swin-T	1x	54.7

MM-GDINO-T 配置文件是 coco/grounding_dino_swin-t_finetune_16xb4_1x_sft_coco.py
由于COCO數(shù)據(jù)集的大小較小，僅在COCO上進(jìn)行繼續(xù)預(yù)訓(xùn)練很容易導(dǎo)致過擬合。上面顯示的結(jié)果是來自第三個(gè)訓(xùn)練周期。我不推薦使用這種方法進(jìn)行訓(xùn)練。

(3) 開放詞匯性能

Architecture	Backbone	Lr schd	box AP	Base box AP	Novel box AP	box AP@50	Base box AP@50	Novel box AP@50
MM-GDINO(zero-shot)	Swin-T		51.1	48.4	58.9	66.7	64.0	74.2
MM-GDINO	Swin-T	1x	57.2	56.1	60.4	73.6	73.0	75.3

MM-GDINO-T 配置文件：coco/grounding_dino_swin-t_finetune_16xb4_1x_coco_48_17.py

LVIS 1.0

(1) 開放集繼續(xù)預(yù)訓(xùn)練性能

Architecture	Backbone	Lr schd	MiniVal APr	MiniVal APc	MiniVal APf	MiniVal AP	Val1.0 APr	Val1.0 APc	Val1.0 APf	Val1.0 AP
GLIP(zero-shot)	Swin-T		18.1	21.2	33.1	26.7	10.8	14.7	29.0	19.6
GDINO(zero-shot)	Swin-T		18.8	24.2	34.7	28.8	10.1	15.3	29.9	20.1
MM-GDINO(zero-shot)	Swin-T		34.2	37.4	46.2	41.4	23.6	27.6	40.5	31.9
MM-GDINO	Swin-T	1x	50.7	58.8	60.1	58.7	45.2	50.2	56.1	51.7

MM-GDINO-T 配置文件：lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis.py

(2) 開放詞匯性能

Architecture	Backbone	Lr schd	MiniVal APr	MiniVal APc	MiniVal APf	MiniVal AP
MM-GDINO(zero-shot)	Swin-T		34.2	37.4	46.2	41.4
MM-GDINO	Swin-T	1x	43.2	57.4	59.3	57.1

MM-GDINO-T 配置文件：lvis/grounding_dino_swin-t_finetune_16xb4_1x_lvis_866_337.py

RefEXP

RefCOCO

Architecture	Backbone	Lr schd	val @1	val @5	val @10	testA @1	testA @5	testA @10	testB @1	testB @5	testB @10
GDINO(zero-shot)	Swin-T		50.8	89.5	94.9	57.5	91.3	95.6	45.0	86.5	92.9
MM-GDINO(zero-shot)	Swin-T		53.1	89.7	95.1	59.1	91.0	95.5	46.8	87.8	93.6
GDINO	Swin-T	UNK	89.2			91.9			86.0
MM-GDINO	Swin-T	5e	89.5	98.6	99.4	91.4	99.2	99.8	86.6	97.9	99.1

MM-GDINO-T 配置文件：refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco.py

RefCOCO+

Architecture	Backbone	Lr schd	val @1	val @5	val @10	testA @1	testA @5	testA @10	testB @1	testB @5	testB @10
GDINO(zero-shot)	Swin-T		51.6	86.4	92.6	57.3	86.7	92.7	46.4	84.1	90.7
MM-GDINO(zero-shot)	Swin-T		52.7	87.7	93.5	58.7	87.2	93.1	48.4	85.8	92.1
GDINO	Swin-T	UNK	81.1			87.4			74.7
MM-GDINO	Swin-T	5e	82.1	97.8	99.2	87.5	99.2	99.7	74.0	96.3	96.4

MM-GDINO-T 配置文件：refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcoco_plus.py

RefCOCOg

Architecture	Backbone	Lr schd	val @1	val @5	val @10	test @1	test @5	test @10
GDINO(zero-shot)	Swin-T		60.4	92.1	96.2	59.7	92.1	96.3
MM-GDINO(zero-shot)	Swin-T		62.9	93.3	97.2	62.9	93.9	97.4
GDINO	Swin-T	UNK	84.2			84.9
MM-GDINO	Swin-T	5e	85.5	98.4	99.4	85.8	98.6	99.4

MM-GDINO-T 配置文件：refcoco/grounding_dino_swin-t_finetune_8xb4_5e_refcocog.py

gRefCOCO

Architecture	Backbone	Lr schd	val Pr@(F1=1, IoU≥0.5)	val N-acc	testA Pr@(F1=1, IoU≥0.5)	testA N-acc	testB Pr@(F1=1, IoU≥0.5)	testB N-acc
GDINO(zero-shot)	Swin-T		41.3	91.8	27.2	90.2	29.7	93.5
MM-GDINO(zero-shot)	Swin-T		41.0	91.3	26.1	93.0	30.4	92.3
MM-GDINO	Swin-T	5e	45.1	64.7	42.5	65.5	40.3	63.2

MM-GDINO-T 配置文件：refcoco/grounding_dino_swin-t_finetune_8xb4_5e_grefcoco.py

MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理

MM-GDINO-T 模型中我們一共提供了 5 種不同數(shù)據(jù)組合的預(yù)訓(xùn)練配置，數(shù)據(jù)采用逐步累加的方式進(jìn)行訓(xùn)練，因此用戶可以根據(jù)自己的實(shí)際需求準(zhǔn)備數(shù)據(jù)。

用到的數(shù)據(jù)集

1 Objects365 v1

對應(yīng)的訓(xùn)練配置為：./grounding_dino_swin-t_pretrain_obj365.py

Objects365_v1 可以從https://opendatalab.com/OpenDataLab/Objects365_v1下載，其提供了 CLI 和 SDK 兩者下載方式。

下載并解壓后，將其放置或者軟鏈接到 data/objects365v1 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

然后使用 coco2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/coco2odvg.py data/objects365v1/objects365_train.json -d o365v1

程序運(yùn)行完成后會(huì)在 data/objects365v1 目錄下創(chuàng)建 o365v1_train_od.json 和 o365v1_label_map.json 兩個(gè)新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── objects365v1
│   │   ├── objects365_train.json
│   │   ├── objects365_val.json
│   │   ├── o365v1_train_od.json
│   │   ├── o365v1_label_map.json
│   │   ├── train
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── test

2 COCO 2017

上述配置在訓(xùn)練過程中會(huì)評估 COCO 2017 數(shù)據(jù)集的性能，因此需要準(zhǔn)備 COCO 2017 數(shù)據(jù)集。你可以從 COCO 官網(wǎng)下載或者從 opendatalab 下載

下載并解壓后，將其放置或者軟鏈接到 data/coco 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

3 GoldG

下載該數(shù)據(jù)集后就可以訓(xùn)練：./grounding_dino_swin-t_pretrain_obj365_goldg.py配置了。

GoldG 數(shù)據(jù)集包括 GQA 和 Flickr30k 兩個(gè)數(shù)據(jù)集，來自 GLIP 論文中提到的 MixedGrounding 數(shù)據(jù)集，其排除了 COCO 數(shù)據(jù)集。下載鏈接為 mdetr_annotations，我們目前需要的是 mdetr_annotations/final_mixed_train_no_coco.json 和 mdetr_annotations/final_flickr_separateGT_train.json 文件。

然后下載 GQA images 圖片。下載并解壓后，將其放置或者軟鏈接到 data/gqa 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── gqa
|   |   ├── final_mixed_train_no_coco.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

然后下載 Flickr30k images 圖片。這個(gè)數(shù)據(jù)下載需要先申請，再獲得下載鏈接后才可以下載。下載并解壓后，將其放置或者軟鏈接到 data/flickr30k_entities 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

對于 GQA 數(shù)據(jù)集，你需要使用 goldg2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/goldg2odvg.py data/gqa/final_mixed_train_no_coco.json

程序運(yùn)行完成后會(huì)在 data/gqa 目錄下創(chuàng)建 final_mixed_train_no_coco_vg.json 新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── gqa
|   |   ├── final_mixed_train_no_coco.json
|   |   ├── final_mixed_train_no_coco_vg.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

對于 Flickr30k 數(shù)據(jù)集，你需要使用 goldg2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/goldg2odvg.py data/flickr30k_entities/final_flickr_separateGT_train.json

程序運(yùn)行完成后會(huì)在 data/flickr30k_entities 目錄下創(chuàng)建 final_flickr_separateGT_train_vg.json 新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── final_flickr_separateGT_train_vg.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

4 GRIT-20M

對應(yīng)的訓(xùn)練配置為 grounding_dino_swin-t_pretrain_obj365_goldg_grit9m

GRIT數(shù)據(jù)集可以從 GRIT 中使用 img2dataset 包下載，默認(rèn)指令下載后數(shù)據(jù)集大小為 1.1T，下載和處理預(yù)估需要至少 2T 硬盤空間，可根據(jù)硬盤容量酌情下載。下載后原始格式為：

mmdetection
├── configs
├── data
│    ├── grit_raw
│    │    ├── 00000_stats.json
│    │    ├── 00000.parquet
│    │    ├── 00000.tar
│    │    ├── 00001_stats.json
│    │    ├── 00001.parquet
│    │    ├── 00001.tar
│    │    ├── ...

下載后需要對格式進(jìn)行進(jìn)一步處理:

python tools/dataset_converters/grit_processing.py data/grit_raw data/grit_processed

處理后的格式為：

mmdetection
├── configs
├── data
│    ├── grit_processed
│    │    ├── annotations
│    │    │   ├── 00000.json
│    │    │   ├── 00001.json
│    │    │   ├── ...
│    │    ├── images
│    │    │   ├── 00000
│    │    │   │   ├── 000000000.jpg
│    │    │   │   ├── 000000003.jpg
│    │    │   │   ├── 000000004.jpg
│    │    │   │   ├── ...
│    │    │   ├── 00001
│    │    │   ├── ...

對于 GRIT 數(shù)據(jù)集，你需要使用 grit2odvg.py 轉(zhuǎn)化成需要的 ODVG 格式：

python tools/dataset_converters/grit2odvg.py data/grit_processed/

程序運(yùn)行完成后會(huì)在 data/grit_processed 目錄下創(chuàng)建 grit20m_vg.json 新文件，大概包含 9M 條數(shù)據(jù)，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│    ├── grit_processed
|    |    ├── grit20m_vg.json
│    │    ├── annotations
│    │    │   ├── 00000.json
│    │    │   ├── 00001.json
│    │    │   ├── ...
│    │    ├── images
│    │    │   ├── 00000
│    │    │   │   ├── 000000000.jpg
│    │    │   │   ├── 000000003.jpg
│    │    │   │   ├── 000000004.jpg
│    │    │   │   ├── ...
│    │    │   ├── 00001
│    │    │   ├── ...

5 V3Det

對應(yīng)的訓(xùn)練配置為

grounding_dino_swin-t_pretrain_obj365_goldg_v3det
grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det

V3Det 數(shù)據(jù)集下載可以從 opendatalab 下載，下載并解壓后，將其放置或者軟鏈接到 data/v3det 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── v3det
│   │   ├── annotations
│   │   |   ├── v3det_2023_v1_train.json
│   │   ├── images
│   │   │   ├── a00000066
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

然后使用 coco2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/coco2odvg.py data/v3det/annotations/v3det_2023_v1_train.json -d v3det

程序運(yùn)行完成后會(huì)在 data/v3det/annotations 目錄下創(chuàng)建目錄下創(chuàng)建 v3det_2023_v1_train_od.json 和 v3det_2023_v1_label_map.json 兩個(gè)新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── v3det
│   │   ├── annotations
│   │   |   ├── v3det_2023_v1_train.json
│   │   |   ├── v3det_2023_v1_train_od.json
│   │   |   ├── v3det_2023_v1_label_map.json
│   │   ├── images
│   │   │   ├── a00000066
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

6 數(shù)據(jù)切分和可視化

考慮到用戶需要準(zhǔn)備的數(shù)據(jù)集過多，不方便對圖片和標(biāo)注進(jìn)行訓(xùn)練前確認(rèn)，因此我們提供了一個(gè)數(shù)據(jù)切分和可視化的工具，可以將數(shù)據(jù)集切分為 tiny 版本，然后使用可視化腳本查看圖片和標(biāo)簽正確性。

切分?jǐn)?shù)據(jù)集

腳本位于這里, 以 Object365 v1 為例，切分?jǐn)?shù)據(jù)集的命令如下：

python tools/misc/split_odvg.py data/object365_v1/ o365v1_train_od.json train your_output_dir --label-map-file o365v1_label_map.json -n 200

上述腳本運(yùn)行后會(huì)在 your_output_dir 目錄下創(chuàng)建和 data/object365_v1/ 一樣的文件夾結(jié)構(gòu)，但是只會(huì)保存 200 張訓(xùn)練圖片和對應(yīng)的 json，方便用戶查看。

可視化原始數(shù)據(jù)集

腳本位于這里, 以 Object365 v1 為例，可視化數(shù)據(jù)集的命令如下：

python tools/analysis_tools/browse_grounding_raw.py data/object365_v1/ o365v1_train_od.json train --label-map-file o365v1_label_map.json -o your_output_dir --not-show

上述腳本運(yùn)行后會(huì)在 your_output_dir 目錄下生成同時(shí)包括圖片和標(biāo)簽的圖片，方便用戶查看。

可視化 dataset 輸出的數(shù)據(jù)集

腳本位于這里, 用戶可以通過該腳本查看 dataset 輸出的結(jié)果即包括了數(shù)據(jù)增強(qiáng)的結(jié)果。以 Object365 v1 為例，可視化數(shù)據(jù)集的命令如下：

python tools/analysis_tools/browse_grounding_dataset.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py  -o your_output_dir --not-show

上述腳本運(yùn)行后會(huì)在 your_output_dir 目錄下生成同時(shí)包括圖片和標(biāo)簽的圖片，方便用戶查看。

MM-GDINO-L 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理

用到的數(shù)據(jù)集

1 Object365 v2

Objects365_v2 可以從 opendatalab 下載，其提供了 CLI 和 SDK 兩者下載方式。

下載并解壓后，將其放置或者軟鏈接到 data/objects365v2 目錄下，目錄結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── objects365v2
│   │   ├── annotations
│   │   │   ├── zhiyuan_objv2_train.json
│   │   ├── train
│   │   │   ├── patch0
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

由于 objects365v2 類別中有部分類名是錯(cuò)誤的，因此需要先進(jìn)行修正。

python tools/dataset_converters/fix_o365_names.py

會(huì)在 data/objects365v2/annotations 下生成新的標(biāo)注文件 zhiyuan_objv2_train_fixname.json。

然后使用 coco2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/coco2odvg.py data/objects365v2/annotations/zhiyuan_objv2_train_fixname.json -d o365v2

程序運(yùn)行完成后會(huì)在 data/objects365v2 目錄下創(chuàng)建 zhiyuan_objv2_train_fixname_od.json 和 o365v2_label_map.json 兩個(gè)新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── objects365v2
│   │   ├── annotations
│   │   │   ├── zhiyuan_objv2_train.json
│   │   │   ├── zhiyuan_objv2_train_fixname.json
│   │   │   ├── zhiyuan_objv2_train_fixname_od.json
│   │   │   ├── o365v2_label_map.json
│   │   ├── train
│   │   │   ├── patch0
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

2 OpenImages v6

OpenImages v6 可以從官網(wǎng) 下載，由于數(shù)據(jù)集比較大，需要花費(fèi)一定的時(shí)間，下載完成后文件結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── OpenImages
│   │   ├── annotations
|   │   │   ├── oidv6-train-annotations-bbox.csv
|   │   │   ├── class-descriptions-boxable.csv
│   │   ├── OpenImages
│   │   │   ├── train
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

然后使用 openimages2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/openimages2odvg.py data/OpenImages/annotations

程序運(yùn)行完成后會(huì)在 data/OpenImages/annotations 目錄下創(chuàng)建 oidv6-train-annotation_od.json 和 openimages_label_map.json 兩個(gè)新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── OpenImages
│   │   ├── annotations
|   │   │   ├── oidv6-train-annotations-bbox.csv
|   │   │   ├── class-descriptions-boxable.csv
|   │   │   ├── oidv6-train-annotations_od.json
|   │   │   ├── openimages_label_map.json
│   │   ├── OpenImages
│   │   │   ├── train
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

3 V3Det

參見前面的 MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理數(shù)據(jù)準(zhǔn)備部分，完整數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── v3det
│   │   ├── annotations
│   │   |   ├── v3det_2023_v1_train.json
│   │   |   ├── v3det_2023_v1_train_od.json
│   │   |   ├── v3det_2023_v1_label_map.json
│   │   ├── images
│   │   │   ├── a00000066
│   │   │   │   ├── xxx.jpg
│   │   │   ├── ...

4 LVIS 1.0

參見后面的 微調(diào)數(shù)據(jù)集準(zhǔn)備 的 2 LVIS 1.0 部分。完整數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── lvis_v1_train.json
│   │   │   ├── lvis_v1_val.json
│   │   │   ├── lvis_v1_train_od.json
│   │   │   ├── lvis_v1_label_map.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── lvis_v1_minival_inserted_image_name.json
│   │   │   ├── lvis_od_val.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

5 COCO2017 OD

數(shù)據(jù)準(zhǔn)備可以參考前面的 MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理 部分。為了方便后續(xù)處理，請將下載的 mdetr_annotations 文件夾軟鏈接或者移動(dòng)到 data/coco 路徑下
完整數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── ...
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

由于 COCO2017 train 和 RefCOCO/RefCOCO+/RefCOCOg/gRefCOCO val 中存在部分重疊，如果不提前移除，在評測 RefExp 時(shí)候會(huì)存在數(shù)據(jù)泄露。

python tools/dataset_converters/remove_cocotrain2017_from_refcoco.py data/coco/mdetr_annotations data/coco/annotations/instances_train2017.json

會(huì)在 data/coco/annotations 目錄下創(chuàng)建 instances_train2017_norefval.json 新文件。最后使用 coco2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017_norefval.json -d coco

會(huì)在 data/coco/annotations 目錄下創(chuàng)建 instances_train2017_norefval_od.json 和 coco_label_map.json 兩個(gè)新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2017_norefval_od.json
│   │   │   ├── coco_label_map.json
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── ...
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

注意： COCO2017 train 和 LVIS 1.0 val 數(shù)據(jù)集有 15000 張圖片重復(fù)，因此一旦在訓(xùn)練中使用了 COCO2017 train，那么 LVIS 1.0 val 的評測結(jié)果就存在數(shù)據(jù)泄露問題，LVIS 1.0 minival 沒有這個(gè)問題。

6 GoldG

參見 MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理部分

mmdetection
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── final_flickr_separateGT_train_vg.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   ├── gqa
|   |   ├── final_mixed_train_no_coco.json
|   |   ├── final_mixed_train_no_coco_vg.json
│   │   ├── images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

7 COCO2014 VG

MDetr 中提供了 COCO2014 train 的 Phrase Grounding 版本標(biāo)注，最原始標(biāo)注文件為 final_mixed_train.json，和之前類似，文件結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── mdetr_annotations
│   │   │   ├── final_mixed_train.json
│   │   │   ├── ...
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

我們可以從 final_mixed_train.json 中提取出 COCO 部分?jǐn)?shù)據(jù)

python tools/dataset_converters/extract_coco_from_mixed.py data/coco/mdetr_annotations/final_mixed_train.json

會(huì)在 data/coco/mdetr_annotations 目錄下創(chuàng)建 final_mixed_train_only_coco.json 新文件，最后使用 goldg2odvg.py 轉(zhuǎn)換為訓(xùn)練所需的 ODVG 格式：

python tools/dataset_converters/goldg2odvg.py data/coco/mdetr_annotations/final_mixed_train_only_coco.json

會(huì)在 data/coco/mdetr_annotations 目錄下創(chuàng)建 final_mixed_train_only_coco_vg.json 新文件，完整結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── mdetr_annotations
│   │   │   ├── final_mixed_train.json
│   │   │   ├── final_mixed_train_only_coco.json
│   │   │   ├── final_mixed_train_only_coco_vg.json
│   │   │   ├── ...
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

注意： COCO2014 train 和 COCO2017 val 沒有重復(fù)圖片，因此不用擔(dān)心 COCO 評測的數(shù)據(jù)泄露問題。

8 Referring Expression Comprehension

其一共包括 4 個(gè)數(shù)據(jù)集。數(shù)據(jù)準(zhǔn)備部分請參見微調(diào)數(shù)據(jù)集準(zhǔn)備部分。

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── finetune_refcoco_testB.json
│   │   │   ├── finetune_refcoco+_testA.json
│   │   │   ├── finetune_refcoco+_testB.json
│   │   │   ├── finetune_refcocog_test.json
│   │   │   ├── finetune_refcoco_train_vg.json
│   │   │   ├── finetune_refcoco+_train_vg.json
│   │   │   ├── finetune_refcocog_train_vg.json
│   │   │   ├── finetune_grefcoco_train_vg.json

9 GRIT-20M

參見 MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理部分

評測數(shù)據(jù)集準(zhǔn)備

1 COCO 2017

數(shù)據(jù)準(zhǔn)備流程和前面描述一致，最終結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

2 LVIS 1.0

LVIS 1.0 val 數(shù)據(jù)集包括 mini 和全量兩個(gè)版本，mini 版本存在的意義是：

LVIS val 全量評測數(shù)據(jù)集比較大，評測一次需要比較久的時(shí)間
LVIS val 全量數(shù)據(jù)集中包括了 15000 張 COCO2017 train, 如果用戶使用了 COCO2017 數(shù)據(jù)進(jìn)行訓(xùn)練，那么將存在數(shù)據(jù)泄露問題

LVIS 1.0 圖片和 COCO2017 數(shù)據(jù)集圖片完全一樣，只是提供了新的標(biāo)注而已，minival 標(biāo)注文件可以從這里下載， val 1.0 標(biāo)注文件可以從這里下載。最終結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── lvis_v1_minival_inserted_image_name.json
│   │   │   ├── lvis_od_val.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

3 ODinW

ODinw 全稱為 Object Detection in the Wild，是用于驗(yàn)證 grounding 預(yù)訓(xùn)練模型在不同實(shí)際場景中的泛化能力的數(shù)據(jù)集，其包括兩個(gè)子集，分別是 ODinW13 和 ODinW35，代表是由 13 和 35 個(gè)數(shù)據(jù)集組成的。你可以從這里下載，然后對每個(gè)文件進(jìn)行解壓，最終結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── odinw
│   │   ├── AerialMaritimeDrone
│   │   |   |── large
│   │   |   |   ├── test
│   │   |   |   ├── train
│   │   |   |   ├── valid
│   │   |   |── tiled
│   │   ├── AmericanSignLanguageLetters
│   │   ├── Aquarium
│   │   ├── BCCD
│   │   ├── ...

在評測 ODinW3535 時(shí)候由于需要自定義 prompt，因此需要提前對標(biāo)注的 json 文件進(jìn)行處理，你可以使用 override_category.py 腳本進(jìn)行處理，處理后會(huì)生成新的標(biāo)注文件，不會(huì)覆蓋原先的標(biāo)注文件。

python configs/mm_grounding_dino/odinw/override_category.py data/odinw/

4 DOD

DOD 來自 Described Object Detection: Liberating Object Detection with Flexible Expressions。其數(shù)據(jù)集可以從這里下載，最終的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── d3
│   │   ├── d3_images
│   │   ├── d3_json
│   │   ├── d3_pkl

5 Flickr30k Entities

在前面 GoldG 數(shù)據(jù)準(zhǔn)備章節(jié)中我們已經(jīng)下載了 Flickr30k 訓(xùn)練所需文件，評估所需的文件是 2 個(gè) json 文件，你可以從這里和這里下載，最終的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── flickr30k_entities
│   │   ├── final_flickr_separateGT_train.json
│   │   ├── final_flickr_separateGT_val.json
│   │   ├── final_flickr_separateGT_test.json
│   │   ├── final_flickr_separateGT_train_vg.json
│   │   ├── flickr30k_images
│   │   │   ├── xxx.jpg
│   │   │   ├── ...

6 Referring Expression Comprehension

指代性表達(dá)式理解包括 4 個(gè)數(shù)據(jù)集： RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO。這 4 個(gè)數(shù)據(jù)集所采用的圖片都來自于 COCO2014 train，和 COCO2017 類似，你可以從 COCO 官方或者 opendatalab 中下載，而標(biāo)注可以直接從這里下載，mdetr_annotations 文件夾里面包括了其他大量的標(biāo)注，你如果覺得數(shù)量過多，可以只下載所需要的幾個(gè) json 文件即可。最終的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── finetune_refcoco_testB.json
│   │   │   ├── finetune_refcoco+_testA.json
│   │   │   ├── finetune_refcoco+_testB.json
│   │   │   ├── finetune_refcocog_test.json
│   │   │   ├── finetune_refcocog_test.json

注意 gRefCOCO 是在 GREC: Generalized Referring Expression Comprehension 被提出，并不在 mdetr_annotations 文件夾中，需要自行處理。具體步驟為：

下載 gRefCOCO，并解壓到 data/coco/ 文件夾中

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   ├── grefs
│   │   │   ├── grefs(unc).json
│   │   │   ├── instances.json

轉(zhuǎn)換為 coco 格式

你可以使用 gRefCOCO 官方提供的轉(zhuǎn)換腳本。注意需要將被注釋的 161 行打開，并注釋 160 行才可以得到全量的 json 文件。

# 需要克隆官方 repo
git clone https://github.com/henghuiding/gRefCOCO.git
cd gRefCOCO/mdetr
python scripts/fine-tuning/grefexp_coco_format.py --data_path ../../data/coco/grefs --out_path ../../data/coco/mdetr_annotations/ --coco_path ../../data/coco

會(huì)在 data/coco/mdetr_annotations/ 文件夾中生成 4 個(gè) json 文件，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── finetune_refcoco_testB.json
│   │   │   ├── finetune_grefcoco_train.json
│   │   │   ├── finetune_grefcoco_val.json
│   │   │   ├── finetune_grefcoco_testA.json
│   │   │   ├── finetune_grefcoco_testB.json

微調(diào)數(shù)據(jù)集準(zhǔn)備

1 COCO 2017

COCO 是檢測領(lǐng)域最常用的數(shù)據(jù)集，我們希望能夠更充分探索其微調(diào)模式。從目前發(fā)展來看，一共有 3 種微調(diào)方式：

閉集微調(diào)，即微調(diào)后文本端將無法修改描述，轉(zhuǎn)變?yōu)殚]集算法，在 COCO 上性能能夠最大化，但是失去了通用性。
開集繼續(xù)預(yù)訓(xùn)練微調(diào)，即對 COCO 數(shù)據(jù)集采用和預(yù)訓(xùn)練一致的預(yù)訓(xùn)練手段。此時(shí)有兩種做法，第一種是降低學(xué)習(xí)率并固定某些模塊，僅僅在 COCO 數(shù)據(jù)上預(yù)訓(xùn)練，第二種是將 COCO 數(shù)據(jù)和部分預(yù)訓(xùn)練數(shù)據(jù)混合一起訓(xùn)練，兩種方式的目的都是在盡可能不降低泛化性時(shí)提高 COCO 數(shù)據(jù)集性能
開放詞匯微調(diào)，即采用 OVD 領(lǐng)域常用做法，將 COCO 類別分成 base 類和 novel 類，訓(xùn)練時(shí)候僅僅在 base 類上進(jìn)行，評測在 base 和 novel 類上進(jìn)行。這種方式可以驗(yàn)證 COCO OVD 能力，目的也是在盡可能不降低泛化性時(shí)提高 COCO 數(shù)據(jù)集性能

(1) 閉集微調(diào)

這個(gè)部分無需準(zhǔn)備數(shù)據(jù)，直接用之前的數(shù)據(jù)即可。

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

(2) 開集繼續(xù)預(yù)訓(xùn)練微調(diào)
這種方式需要將 COCO 訓(xùn)練數(shù)據(jù)轉(zhuǎn)換為 ODVG 格式，你可以使用如下命令轉(zhuǎn)換：

python tools/dataset_converters/coco2odvg.py data/coco/annotations/instances_train2017.json -d coco

會(huì)在 data/coco/annotations/ 下生成新的 instances_train2017_od.json 和 coco2017_label_map.json，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_train2017_od.json
│   │   │   ├── coco2017_label_map.json
│   │   │   ├── instances_val2017.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

在得到數(shù)據(jù)后，你可以自行選擇單獨(dú)預(yù)習(xí)還是混合預(yù)訓(xùn)練方式。

(3) 開放詞匯微調(diào)
這種方式需要將 COCO 訓(xùn)練數(shù)據(jù)轉(zhuǎn)換為 OVD 格式，你可以使用如下命令轉(zhuǎn)換：

python tools/dataset_converters/coco2ovd.py data/coco/

會(huì)在 data/coco/annotations/ 下生成新的 instances_val2017_all_2.json 和 instances_val2017_seen_2.json，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_train2017_od.json
│   │   │   ├── instances_val2017_all_2.json
│   │   │   ├── instances_val2017_seen_2.json
│   │   │   ├── coco2017_label_map.json
│   │   │   ├── instances_val2017.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

然后可以直接使用配置進(jìn)行訓(xùn)練和測試。

2 LVIS 1.0

LVIS 是一個(gè)包括 1203 類的數(shù)據(jù)集，同時(shí)也是一個(gè)長尾聯(lián)邦數(shù)據(jù)集，對其進(jìn)行微調(diào)很有意義。由于其類別過多，我們無法對其進(jìn)行閉集微調(diào)，因此只能采用開集繼續(xù)預(yù)訓(xùn)練微調(diào)和開放詞匯微調(diào)。

你需要先準(zhǔn)備好 LVIS 訓(xùn)練 JSON 文件，你可以從這里下載，我們只需要 lvis_v1_train.json 和 lvis_v1_val.json，然后將其放到 data/coco/annotations/ 下，然后運(yùn)行如下命令：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── lvis_v1_train.json
│   │   │   ├── lvis_v1_val.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── lvis_v1_minival_inserted_image_name.json
│   │   │   ├── lvis_od_val.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

(1) 開集繼續(xù)預(yù)訓(xùn)練微調(diào)

使用如下命令轉(zhuǎn)換為 ODVG 格式：

python tools/dataset_converters/lvis2odvg.py data/coco/annotations/lvis_v1_train.json

會(huì)在 data/coco/annotations/ 下生成新的 lvis_v1_train_od.json 和 lvis_v1_label_map.json，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── lvis_v1_train.json
│   │   │   ├── lvis_v1_val.json
│   │   │   ├── lvis_v1_train_od.json
│   │   │   ├── lvis_v1_label_map.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── lvis_v1_minival_inserted_image_name.json
│   │   │   ├── lvis_od_val.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

然后可以直接使用配置進(jìn)行訓(xùn)練測試，或者你修改配置將其和部分預(yù)訓(xùn)練數(shù)據(jù)集混合使用。

(2) 開放詞匯微調(diào)

使用如下命令轉(zhuǎn)換為 OVD 格式：

python tools/dataset_converters/lvis2ovd.py data/coco/

會(huì)在 data/coco/annotations/ 下生成新的 lvis_v1_train_od_norare.json 和 lvis_v1_label_map_norare.json，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── lvis_v1_train.json
│   │   │   ├── lvis_v1_val.json
│   │   │   ├── lvis_v1_train_od.json
│   │   │   ├── lvis_v1_label_map.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── lvis_v1_minival_inserted_image_name.json
│   │   │   ├── lvis_od_val.json
│   │   │   ├── lvis_v1_train_od_norare.json
│   │   │   ├── lvis_v1_label_map_norare.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...

然后可以直接使用配置進(jìn)行訓(xùn)練測試

3 RTTS

RTTS 是一個(gè)濃霧天氣數(shù)據(jù)集，該數(shù)據(jù)集包含 4,322 張霧天圖像，包含五個(gè)類：自行車 (bicycle)、公共汽車 (bus)、汽車 (car)、摩托車 (motorbike) 和人 (person)?？梢詮?這里下載, 然后解壓到 data/RTTS/ 文件夾中。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── RTTS
│   │   ├── annotations_json
│   │   ├── annotations_xml
│   │   ├── ImageSets
│   │   ├── JPEGImages

4 RUOD

RUOD 是一個(gè)水下目標(biāo)檢測數(shù)據(jù)集，你可以從這里下載, 然后解壓到 data/RUOD/ 文件夾中。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── RUOD
│   │   ├── Environment_pic
│   │   ├── Environmet_ANN
│   │   ├── RUOD_ANN
│   │   ├── RUOD_pic

5 Brain Tumor

Brain Tumor 是一個(gè)醫(yī)學(xué)領(lǐng)域的 2d 檢測數(shù)據(jù)集，你可以從這里下載, 請注意選擇 COCO JSON 格式。然后解壓到 data/brain_tumor_v2/ 文件夾中。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── brain_tumor_v2
│   │   ├── test
│   │   ├── train
│   │   ├── valid

6 Cityscapes

Cityscapes 是一個(gè)城市街景數(shù)據(jù)集，你可以從這里或者 opendatalab 中下載, 然后解壓到 data/cityscapes/ 文件夾中。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── cityscapes
│   │   ├── annotations
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val

在下載后，然后使用 cityscapes.py 腳本生成我們所需要的 json 格式

python tools/dataset_converters/cityscapes.py data/cityscapes/

會(huì)在 annotations 中生成 3 個(gè)新的 json 文件。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── cityscapes
│   │   ├── annotations
│   │   │   ├── instancesonly_filtered_gtFine_train.json
│   │   │   ├── instancesonly_filtered_gtFine_val.json
│   │   │   ├── instancesonly_filtered_gtFine_test.json
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val

7 People in Painting

People in Painting 是一個(gè)油畫數(shù)據(jù)集，你可以從這里, 請注意選擇 COCO JSON 格式。然后解壓到 data/people_in_painting_v2/ 文件夾中。完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── people_in_painting_v2
│   │   ├── test
│   │   ├── train
│   │   ├── valid

8 Referring Expression Comprehension

指代性表達(dá)式理解的微調(diào)和前面一樣，也是包括 4 個(gè)數(shù)據(jù)集，在評測數(shù)據(jù)準(zhǔn)備階段已經(jīng)全部整理好了，完整的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── finetune_refcoco_testB.json
│   │   │   ├── finetune_refcoco+_testA.json
│   │   │   ├── finetune_refcoco+_testB.json
│   │   │   ├── finetune_refcocog_test.json
│   │   │   ├── finetune_refcocog_test.json

然后我們需要將其轉(zhuǎn)換為所需的 ODVG 格式，請使用 refcoco2odvg.py 腳本轉(zhuǎn)換，

python tools/dataset_converters/refcoco2odvg.py data/coco/mdetr_annotations

會(huì)在 data/coco/mdetr_annotations 中生成新的 4 個(gè) json 文件。轉(zhuǎn)換后的數(shù)據(jù)集結(jié)構(gòu)如下：

mmdetection
├── configs
├── data
│   ├── coco
│   │   ├── annotations
│   │   │   ├── instances_train2017.json
│   │   │   ├── instances_val2017.json
│   │   │   ├── instances_train2014.json
│   │   ├── train2017
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── val2017
│   │   │   ├── xxxx.jpg
│   │   │   ├── ...
│   │   ├── train2014
│   │   │   ├── xxx.jpg
│   │   │   ├── ...
│   │   ├── mdetr_annotations
│   │   │   ├── final_refexp_val.json
│   │   │   ├── finetune_refcoco_testA.json
│   │   │   ├── finetune_refcoco_testB.json
│   │   │   ├── finetune_refcoco+_testA.json
│   │   │   ├── finetune_refcoco+_testB.json
│   │   │   ├── finetune_refcocog_test.json
│   │   │   ├── finetune_refcoco_train_vg.json
│   │   │   ├── finetune_refcoco+_train_vg.json
│   │   │   ├── finetune_refcocog_train_vg.json
│   │   │   ├── finetune_grefcoco_train_vg.json

推理與微調(diào)

需要安裝額外的依賴包：

cd $MMDETROOTpip install -r requirements/multimodal.txt
pip install emoji ddd-dataset
pip install git+https://github.com/lvis-dataset/lvis-api.git"

請注意由于 LVIS 第三方庫暫時(shí)不支持 numpy 1.24，因此請確保您的 numpy 版本符合要求。建議安裝 numpy 1.23 版本。

MM Grounding DINO-T 模型權(quán)重下載

為了方便演示，您可以提前下載 MM Grounding DINO-T 模型權(quán)重到當(dāng)前路徑下

wget load_from = 'https://download.openmmlab.com/mmdetection/v3.0/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth' # noqa

模型的權(quán)重和對應(yīng)的配置詳見：

Model	Backbone	Style	COCO mAP	Pre-Train Data	Config	Download
GDINO-T	Swin-T	Zero-shot	46.7	O365
GDINO-T	Swin-T	Zero-shot	48.1	O365,GoldG
GDINO-T	Swin-T	Zero-shot	48.4	O365,GoldG,Cap4M	config	model
MM-GDINO-T	Swin-T	Zero-shot	48.5(+1.8)	O365	config
MM-GDINO-T	Swin-T	Zero-shot	50.4(+2.3)	O365,GoldG	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.5(+2.1)	O365,GoldG,GRIT	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.6(+2.2)	O365,GoldG,V3Det	config	model \| log
MM-GDINO-T	Swin-T	Zero-shot	50.4(+2.0)	O365,GoldG,GRIT,V3Det	config	model \| log
MM-GDINO-B	Swin-B	Zero-shot	52.5	O365,GoldG,V3Det	config	model \| log
MM-GDINO-B*	Swin-B	-	59.5	O365,ALL	config	model \| log
MM-GDINO-L	Swin-L	Zero-shot	53.0	O365V2,OpenImageV6,GoldG	config	model \| log
MM-GDINO-L*	Swin-L	-	60.3	O365V2,OpenImageV6,ALL	config	model \| log

這個(gè)*表示模型尚未完全訓(xùn)練。我們將在未來發(fā)布最終權(quán)重。
ALL: GoldG,V3det,COCO2017,LVISV1,COCO2014,GRIT,RefCOCO,RefCOCO+,RefCOCOg,gRefCOCO。

推理

在推理前，為了更好的體驗(yàn)不同圖片的推理效果，建議您先下載這些圖片到當(dāng)前路徑下

MM Grounding DINO 支持了閉集目標(biāo)檢測，開放詞匯目標(biāo)檢測，Phrase Grounding 和指代性表達(dá)式理解 4 種推理方式，下面詳細(xì)說明。

(1) 閉集目標(biāo)檢測

由于 MM Grounding DINO 是預(yù)訓(xùn)練模型，理論上可以應(yīng)用于任何閉集檢測數(shù)據(jù)集，目前我們支持了常用的 coco/voc/cityscapes/objects365v1/lvis 等，下面以 coco 為例

python demo/image_demo.py images/animals.png \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts '$: coco'

會(huì)在當(dāng)前路徑下生成 outputs/vis/animals.png 的預(yù)測結(jié)果，如下圖所示

在這里插入圖片描述

由于鴕鳥并不在 COCO 80 類中, 因此不會(huì)檢測出來。

需要注意，由于 objects365v1 和 lvis 類別很多，如果直接將類別名全部輸入到網(wǎng)絡(luò)中，會(huì)超過 256 個(gè) token 導(dǎo)致模型預(yù)測效果極差，此時(shí)我們需要通過 --chunked-size 參數(shù)進(jìn)行截?cái)囝A(yù)測, 同時(shí)預(yù)測時(shí)間會(huì)比較長。

python demo/image_demo.py images/animals.png \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts '$: lvis'  --chunked-size 70 \--palette random

在這里插入圖片描述

不同的 --chunked-size 會(huì)導(dǎo)致不同的預(yù)測效果，您可以自行嘗試。

(2) 開放詞匯目標(biāo)檢測

開放詞匯目標(biāo)檢測是指在推理時(shí)候，可以輸入任意的類別名

python demo/image_demo.py images/animals.png \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts 'zebra. giraffe' -c

在這里插入圖片描述

(3) Phrase Grounding

Phrase Grounding 是指的用戶輸入一句語言描述，模型自動(dòng)對其涉及到的名詞短語想對應(yīng)的 bbox 進(jìn)行檢測，有兩種用法

這里用到了NLTK 庫，首先，尋找NLTK 的文件路徑，執(zhí)行代碼：

import nltkif __name__ == "__main__":print(nltk.find("."))

如下圖：
在這里插入圖片描述
下載NLTK ，將其放到上面的任意路徑。下載鏈接：https://gitee.com/qwererer2/nltk_data/tree/gh-pages。解壓后將packages重新命名為nltk_data，然后將nltk_data移動(dòng)上面圖片中的任意目錄。

在這里插入圖片描述

新建任意腳本，運(yùn)行下面代碼：

    from nltk.book import *

出現(xiàn)下圖結(jié)果則表明沒有問題。
在這里插入圖片描述

通過 NLTK 庫自動(dòng)提取名詞短語，然后進(jìn)行檢測

python demo/image_demo.py images/apples.jpg \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts 'There are many apples here.'

在這里插入圖片描述

程序內(nèi)部會(huì)自動(dòng)切分出 many apples 作為名詞短語，然后檢測出對應(yīng)物體。不同的輸入描述對預(yù)測結(jié)果影響很大。

用戶自己指定句子中哪些為名詞短語，避免 NLTK 提取錯(cuò)誤的情況

python demo/image_demo.py images/fruit.jpg \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts 'The picture contains watermelon, flower, and a white bottle.' \--tokens-positive "[[[21,31]], [[45,59]]]"  --pred-score-thr 0.12

21,31 對應(yīng)的名詞短語為 watermelon，45,59 對應(yīng)的名詞短語為 a white bottle。

在這里插入圖片描述

(4) 指代性表達(dá)式理解

指代性表達(dá)式理解是指的用戶輸入一句語言描述，模型自動(dòng)對其涉及到的指代性表達(dá)式進(jìn)行理解, 不需要進(jìn)行名詞短語提取。

python demo/image_demo.py images/apples.jpg \configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \--weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth \--texts 'red apple.' \--tokens-positive -1

在這里插入圖片描述

評測

我們所提供的評測腳本都是統(tǒng)一的，你只需要提前準(zhǔn)備好數(shù)據(jù)，然后運(yùn)行相關(guān)配置就可以了

(1) Zero-Shot COCO2017 val

# 單卡
python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth# 8 卡
./tools/dist_test.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8

(2) Zero-Shot ODinW13

# 單卡
python tools/test.py configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth# 8 卡
./tools/dist_test.sh configs/mm_grounding_dino/odinw/grounding_dino_swin-t_pretrain_odinw13.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth 8

評測數(shù)據(jù)集結(jié)果可視化

為了方便大家對模型預(yù)測結(jié)果進(jìn)行可視化和分析，我們支持了評測數(shù)據(jù)集預(yù)測結(jié)果可視化，以指代性表達(dá)式理解為例用法如下：

python tools/test.py configs/mm_grounding_dino/refcoco/grounding_dino_swin-t_pretrain_zeroshot_refexp \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --work-dir refcoco_result --show-dir save_path

模型在推理過程中會(huì)將可視化結(jié)果保存到 refcoco_result/{當(dāng)前時(shí)間戳}/save_path 路徑下。其余評測數(shù)據(jù)集可視化只需要替換配置文件即可。

下面展示一些數(shù)據(jù)集的可視化結(jié)果：左圖為 GT，右圖為預(yù)測結(jié)果

COCO2017 val 結(jié)果：

在這里插入圖片描述

Flickr30k Entities 結(jié)果：

在這里插入圖片描述

DOD 結(jié)果：

在這里插入圖片描述

RefCOCO val 結(jié)果：

在這里插入圖片描述

RefCOCO testA 結(jié)果：

在這里插入圖片描述

gRefCOCO val 結(jié)果：

在這里插入圖片描述

模型訓(xùn)練

如果想復(fù)現(xiàn)我們的結(jié)果，你可以在準(zhǔn)備好數(shù)據(jù)集后，直接通過如下命令進(jìn)行訓(xùn)練

# 單機(jī) 8 卡訓(xùn)練僅包括 obj365v1 數(shù)據(jù)集
./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365.py 8
# 單機(jī) 8 卡訓(xùn)練包括 obj365v1/goldg/grit/v3det 數(shù)據(jù)集，其余數(shù)據(jù)集類似
./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det.py 8

多機(jī)訓(xùn)練的用法請參考 train.md。MM-Grounding-DINO T 模型默認(rèn)采用的是 32 張 3090Ti，如果你的總 bs 數(shù)不是 32x4=128，那么你需要手動(dòng)的線性調(diào)整學(xué)習(xí)率。

預(yù)訓(xùn)練自定義格式說明

為了統(tǒng)一不同數(shù)據(jù)集的預(yù)訓(xùn)練格式，我們參考 Open-GroundingDino 所設(shè)計(jì)的格式。具體來說分成 2 種格式

(1) 目標(biāo)檢測數(shù)據(jù)格式 OD

{"filename": "obj365_train_000000734304.jpg","height": 512,"width": 769,"detection": {"instances": [{"bbox": [109.4768676992, 346.0190429696, 135.1918335098, 365.3641967616], "label": 2, "category": "chair"},{"bbox": [58.612365705900004, 323.2281494016, 242.6005859067, 451.4166870016], "label": 8, "category": "car"}]}
}

label字典中所對應(yīng)的數(shù)值需要和相應(yīng)的 label_map 一致。 instances 列表中的每一項(xiàng)都對應(yīng)一個(gè) bbox (x1y1x2y2 格式)。

(2) phrase grounding 數(shù)據(jù)格式 VG

{"filename": "2405116.jpg","height": 375,"width": 500,"grounding":{"caption": "Two surfers walking down the shore. sand on the beach.","regions": [{"bbox": [206, 156, 282, 248], "phrase": "Two surfers", "tokens_positive": [[0, 3], [4, 11]]},{"bbox": [303, 338, 443, 343], "phrase": "sand", "tokens_positive": [[36, 40]]},{"bbox": [[327, 223, 421, 282], [300, 200, 400, 210]], "phrase": "beach", "tokens_positive": [[48, 53]]}]}

tokens_positive 表示當(dāng)前 phrase 在 caption 中的字符位置。

自定義數(shù)據(jù)集微調(diào)訓(xùn)練案例

為了方便用戶針對自定義數(shù)據(jù)集進(jìn)行下游微調(diào)，我們特意提供了以簡單的 cat 數(shù)據(jù)集為例的微調(diào)訓(xùn)練案例。

1 數(shù)據(jù)準(zhǔn)備

cd mmdetection
wget https://download.openmmlab.com/mmyolo/data/cat_dataset.zip
unzip cat_dataset.zip -d data/cat/

cat 數(shù)據(jù)集是一個(gè)單類別數(shù)據(jù)集，包含 144 張圖片，已經(jīng)轉(zhuǎn)換為 coco 格式。

2 配置準(zhǔn)備

由于 cat 數(shù)據(jù)集的簡單性和數(shù)量較少，我們使用 8 卡訓(xùn)練 20 個(gè) epoch，相應(yīng)的縮放學(xué)習(xí)率，不訓(xùn)練語言模型，只訓(xùn)練視覺模型。

詳細(xì)的配置信息可以在 grounding_dino_swin-t_finetune_8xb4_20e_cat 中找到。

3 可視化和 Zero-Shot 評估

由于 MM Grounding DINO 是一個(gè)開放的檢測模型，所以即使沒有在 cat 數(shù)據(jù)集上訓(xùn)練，也可以進(jìn)行檢測和評估。

單張圖片的可視化結(jié)果如下：

cd mmdetection
python demo/image_demo.py data/cat/images/IMG_20211205_120756.jpg configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py --weights grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth --texts cat.

測試集上的 Zero-Shot 評估結(jié)果如下：

python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.881Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 1.000Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.929Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.881Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.913Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.913Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.913Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.913

4 模型訓(xùn)練

./tools/dist_train.sh configs/mm_grounding_dino/grounding_dino_swin-t_finetune_8xb4_20e_cat.py 8 --work-dir cat_work_dir

模型將會(huì)保存性能最佳的模型。在第 16 epoch 時(shí)候達(dá)到最佳，性能如下所示：

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.901Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 1.000Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.930Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.901Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.967Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.967Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.967Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = -1.000Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = -1.000Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.967

我們可以發(fā)現(xiàn)，經(jīng)過微調(diào)訓(xùn)練后，cat 數(shù)據(jù)集的訓(xùn)練性能從 88.1 提升到了 90.1。同時(shí)由于數(shù)據(jù)集比較小，評估指標(biāo)波動(dòng)比較大。

模型自訓(xùn)練偽標(biāo)簽迭代生成和優(yōu)化 pipeline

為了方便用戶從頭構(gòu)建自己的數(shù)據(jù)集或者希望利用模型推理能力進(jìn)行自舉式偽標(biāo)簽迭代生成和優(yōu)化，不斷修改偽標(biāo)簽來提升模型性能，我們特意提供了相關(guān)的 pipeline。

由于我們定義了兩種數(shù)據(jù)格式，為了演示我們也將分別進(jìn)行說明。

1 目標(biāo)檢測格式

此處我們依然采用上述的 cat 數(shù)據(jù)集為例，假設(shè)我們目前只有一系列圖片和預(yù)定義的類別，并不存在標(biāo)注。

生成初始 odvg 格式文件

import os
import cv2
import json
import jsonlinesdata_root = 'data/cat'
images_path = os.path.join(data_root, 'images')
out_path = os.path.join(data_root, 'cat_train_od.json')
metas = []
for files in os.listdir(images_path):img = cv2.imread(os.path.join(images_path, files))height, width, _ = img.shapemetas.append({"filename": files, "height": height, "width": width})with jsonlines.open(out_path, mode='w') as writer:writer.write_all(metas)# 生成 label_map.json，由于只有一個(gè)類別，所以只需要寫一個(gè) cat 即可
label_map_path = os.path.join(data_root, 'cat_label_map.json')
with open(label_map_path, 'w') as f:json.dump({'0': 'cat'}, f)

會(huì)在 data/cat 目錄下生成 cat_train_od.json 和 cat_label_map.json 兩個(gè)文件。

使用預(yù)訓(xùn)練模型進(jìn)行推理，并保存結(jié)果

我們提供了直接可用的配置, 如果你是其他數(shù)據(jù)集可以參考這個(gè)配置進(jìn)行修改。

python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_cat.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth

會(huì)在 data/cat 目錄下新生成 cat_train_od_v1.json 文件，你可以手動(dòng)打開確認(rèn)或者使用腳本可視化效果

python tools/analysis_tools/browse_grounding_raw.py data/cat/ cat_train_od_v1.json images --label-map-file cat_label_map.json -o your_output_dir --not-show

會(huì)在 your_output_dir 目錄下生成可視化結(jié)果

繼續(xù)訓(xùn)練提高性能

在得到偽標(biāo)簽后，你可以混合一些預(yù)訓(xùn)練數(shù)據(jù)聯(lián)合進(jìn)行繼續(xù)預(yù)訓(xùn)練，提升模型在當(dāng)前數(shù)據(jù)集上的性能，然后重新運(yùn)行 2 步驟，得到更準(zhǔn)確的偽標(biāo)簽，如此循環(huán)迭代即可。

2 Phrase Grounding 格式

生成初始 odvg 格式文件

Phrase Grounding 的自舉流程要求初始時(shí)候提供每張圖片對應(yīng)的 caption 和提前切割好的 phrase 信息。以 flickr30k entities 圖片為例，生成的典型的文件應(yīng)該如下所示：

[
{"filename": "3028766968.jpg","height": 375,"width": 500,"grounding":{"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .","regions": [{"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]},{"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]},{"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]},{"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]},{"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
{"filename": "6944134083.jpg","height": 319,"width": 500,"grounding":{"caption": "Two men are competing in a horse race .","regions": [{"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}
]

初始時(shí)候 bbox 必須要設(shè)置為 [0, 0, 1, 1]，因?yàn)檫@能確保程序正常運(yùn)行，但是 bbox 的值并不會(huì)被使用。

{"filename": "3028766968.jpg", "height": 375, "width": 500, "grounding": {"caption": "Man with a black shirt on sit behind a desk sorting threw a giant stack of people work with a smirk on his face .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "a giant stack of people", "tokens_positive": [[58, 81]]}, {"bbox": [0, 0, 1, 1], "phrase": "a black shirt", "tokens_positive": [[9, 22]]}, {"bbox": [0, 0, 1, 1], "phrase": "a desk", "tokens_positive": [[37, 43]]}, {"bbox": [0, 0, 1, 1], "phrase": "his face", "tokens_positive": [[103, 111]]}, {"bbox": [0, 0, 1, 1], "phrase": "Man", "tokens_positive": [[0, 3]]}]}}
{"filename": "6944134083.jpg", "height": 319, "width": 500, "grounding": {"caption": "Two men are competing in a horse race .", "regions": [{"bbox": [0, 0, 1, 1], "phrase": "Two men", "tokens_positive": [[0, 7]]}]}}

你可直接復(fù)制上面的文本，并假設(shè)將文本內(nèi)容粘貼到命名為 flickr_simple_train_vg.json 文件中，并放置于提前準(zhǔn)備好的 data/flickr30k_entities 數(shù)據(jù)集目錄下，具體見數(shù)據(jù)準(zhǔn)備文檔。

使用預(yù)訓(xùn)練模型進(jìn)行推理，并保存結(jié)果

我們提供了直接可用的配置, 如果你是其他數(shù)據(jù)集可以參考這個(gè)配置進(jìn)行修改。

python tools/test.py configs/mm_grounding_dino/grounding_dino_swin-t_pretrain_pseudo-labeling_flickr30k.py \grounding_dino_swin-t_pretrain_obj365_goldg_grit9m_v3det_20231204_095047-b448804b.pth

會(huì)在 data/flickr30k_entities 目錄下新生成 flickr_simple_train_vg_v1.json 文件，你可以手動(dòng)打開確認(rèn)或者使用腳本可視化效果

python tools/analysis_tools/browse_grounding_raw.py data/flickr30k_entities/ flickr_simple_train_vg_v1.json flickr30k_images -o your_output_dir --not-show

會(huì)在 your_output_dir 目錄下生成可視化結(jié)果，如下圖所示：

繼續(xù)訓(xùn)練提高性能

查看全文

http://m.aloenet.com.cn/news/28375.html

国产亚洲精品福利在线无卡一,国产精久久一区二区三区,亚洲精品无码国模,精品久久久久久无码专区不卡

文章目錄

摘要

安裝基礎(chǔ)環(huán)境

新建虛擬環(huán)境

安裝pytorch

安裝openmim、mmengine、mmcv

安裝 MMDetection

驗(yàn)證安裝

配置OV-DINO環(huán)境

MMDetection的MM-Grounding-DINO詳細(xì)介紹

測試結(jié)果

Zero-Shot COCO 結(jié)果與模型

Zero-Shot LVIS Results

Zero-Shot ODinW（野生環(huán)境下的目標(biāo)檢測）結(jié)果

ODinW13的結(jié)果和模型

ODinW35的結(jié)果和模型

零樣本指代表達(dá)式理解結(jié)果

零樣本描述檢測數(shù)據(jù)集（DOD）

Pretrain Flickr30k Results

通過微調(diào)驗(yàn)證預(yù)訓(xùn)練模型的泛化能力

RTTS

RUOD

Brain Tumor

Cityscapes

People in Painting

COCO

LVIS 1.0

RefEXP

RefCOCO

RefCOCO+

RefCOCOg

gRefCOCO

MM-GDINO-T 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理

用到的數(shù)據(jù)集

1 Objects365 v1

2 COCO 2017

3 GoldG

4 GRIT-20M

5 V3Det

6 數(shù)據(jù)切分和可視化

MM-GDINO-L 預(yù)訓(xùn)練數(shù)據(jù)準(zhǔn)備和處理

用到的數(shù)據(jù)集

1 Object365 v2

2 OpenImages v6

3 V3Det

4 LVIS 1.0

5 COCO2017 OD

6 GoldG

7 COCO2014 VG

8 Referring Expression Comprehension

9 GRIT-20M

評測數(shù)據(jù)集準(zhǔn)備

1 COCO 2017

2 LVIS 1.0

3 ODinW

4 DOD

5 Flickr30k Entities

6 Referring Expression Comprehension

微調(diào)數(shù)據(jù)集準(zhǔn)備

1 COCO 2017

2 LVIS 1.0

3 RTTS

4 RUOD

5 Brain Tumor

6 Cityscapes

7 People in Painting

8 Referring Expression Comprehension

推理與微調(diào)

MM Grounding DINO-T 模型權(quán)重下載

推理

評測

評測數(shù)據(jù)集結(jié)果可視化

模型訓(xùn)練

預(yù)訓(xùn)練自定義格式說明

自定義數(shù)據(jù)集微調(diào)訓(xùn)練案例

1 數(shù)據(jù)準(zhǔn)備

2 配置準(zhǔn)備

3 可視化和 Zero-Shot 評估

4 模型訓(xùn)練

安裝openmim、mmengine、mmcv