Preparing Data for YOLO-World
Overview
For pre-training YOLO-World, we adopt several datasets as listed in the below table:
Data | Samples | Type | Boxes |
|---|---|---|---|
Objects365v1 | 609k | detection | 9,621k |
GQA | 621k | grounding | 3,681k |
Flickr | 149k | grounding | 641k |
CC3M-Lite | 245k | image-text | 821k |
Dataset Directory
We put all data into the data directory, such as:
├── coco
│ ├── annotations
│ ├── lvis
│ ├── train2017
│ ├── val2017
├── flickr
│ ├── annotations
│ └── images
├── mixed_grounding
│ ├── annotations
│ ├── images
├── mixed_grounding
│ ├── annotations
│ ├── images
├── objects365v1
│ ├── annotations
│ ├── train
│ ├── valNOTE: We strongly suggest that you check the directories or paths in the dataset part of the config file, especially for the values ann_file, data_root, and data_prefix.
We provide the annotations of the pre-training data in the below table:
Data | images | Annotation File |
|---|---|---|
Objects365v1 | Objects365 train | objects365_train.json |
MixedGrounding | GQA | final_mixed_train_no_coco.json |
Flickr30k | Flickr30k | final_flickr_separateGT_train.json |
LVIS-minival | COCO val2017 | lvis_v1_minival_inserted_image_name.json |
Acknowledgement: We sincerely thank GLIP and mdetr for providing the annotation files for pre-training.
Dataset Class
For training YOLO-World, we mainly adopt two kinds of dataset classs:
1. MultiModalDataset
MultiModalDataset is a simple wrapper for pre-defined Dataset Class, such as Objects365 or COCO, which add the texts (category texts) into the dataset instance for formatting input texts.
Text JSON
The json file is formatted as follows:
代码语言:javascript复制[
['A_1','A_2'],
['B'],
['C_1', 'C_2', 'C_3'],
...
]We have provided the text json for LVIS, COCO, and Objects365
2. YOLOv5MixedGroundingDataset
The YOLOv5MixedGroundingDataset extends the COCO dataset by supporting loading texts/captions from the json file. It’s desgined for MixedGrounding or Flickr30K with text tokens for each object.


