RoboRefIt

Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, Shengjin Wang
Department of Electronic Engineering, Tsinghua University

Abstract

Multi-model learning of vision-and-language is beneficial for improving the robotic interaction and perception ability under complex environments. Visual grounding is an emerging vision-and-language task that aims to find the unique target in the picture by a referring expression. Some current researches of voice interactive robot manipulation have attempted to integrate the visual grounding task into the robot operation engine. However, existing visual grounding datasets are hard to be used as a suitable practical test bed for human-robot interaction because of their low correlation with robotics. In the work of the VL-Grasp, we contribute a new challenging visual grounding dataset for robotic perception and reasoning in indoor environments, called RoboRefIt. The RoboRefIt collects 10,872 real-world RGB and depth images from cluttered daily life scenes, and generates 50,758 referring expressions in the form of robot language instructions. Moreover, nearly half of the images involve ambiguous object recognition. We hope that the RoboRefIt provides a distinctive training bed of visual grounding tasks for the robot interactive grasp.

Overview

The RoboRefIt dataset is proposed in the work of the VL-Grasp. The RoboRefIt dataset is a new challenging visual grounding dataset and is specially designed for the robot interactive grasp task. Data collection, data annotation and datasets comparison details can be found in the VL-Grasp. The content of this homepage is mainly used as a supplement to the VL-Grasp paper, introducing the dataset statistics, datasheets, data format, and so on.

Figure 1

Dataset Statistics for the RoboRefIt

We make statistics on the some main data components of the RoboRefIt dataset. First, in the aspect of text content, the average length of the sentences is 9.5 and the detailed distribution of sentence lengths is shown in Figure 2. Figure 3 reflects the frequency of words or phrases in expressions according to the size. Second, in the aspect of object selection, the RoboRefIt dataset contains 66 categories of objects and the distribution of number of instances per category is shown in Figure 4. Figure 5 shows the cumulative distribution curve of the size of the bounding boxes and the masks, where a large number of objects have a pixel area from 10 3 to 10 4. Third, in the aspect of scene collection, the objects of the RoboRefIt are distributed in six life scenes as is shown in Figure 6. Further, for example, there are 24,521 objects collected from the table scene, while these tables are also diverse.

Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Comparison of train-testA-testB Splits

The RefIndoor split the raw dataset into the train set, the testA set and the testB set. The train set contains 7929 images and 36915 sentences. The testA set contains 1859 images and 8523 sentences and the testB set contains 1083 images and 5320 sentences. The difference between the testA set and the testB set is the correlation of scene distribution with the training set. The testB set is a more difficult set than the testA set. As shown in the Figure 7, the distribution of the train set is long-tailed and the distribution curve of the testA is similar to the trend of the train set. But there are clustered samples of the tail categories in the testB. As is shown in Figure 8, the testA set is similar with the train set in the proportion of the number of instances among the six scenes, while the testB set has a different distribution. For instance, the testB contains novel scenes, like drawer scenes, new table scenes and new sofa scenes.

Figure 7
Figure 8

Datasheets for the RoboRefIt

The RoboRefIt comprises RGB images, depth images, expressions, masks, bounding boxes. Each RGB image corresponds to a depth image, and multiple expressions, masks, bounding boxes. Each expression describes an object in the image and corresponds one mask and one bounding box.

Table 1

The RoboRefIt comprises 10,872 RGB-D images and 50,758 language expressions. There are 150 instances of objects involving 66 categories. These objects present at the images as the targets. The object set of the RefIndoor includes 66 object categories and 150 instances. In order to make the statistics clearer, Table 1 shows the number of instances per category. The instance here refers to the target object used in the process of actually placing objects and collecting pictures. Each category may contains multiple objects. You can understand these objects more clearly by the Object list.

Visualization Results of Visual Grounding

We benchmark the RefIndoor dataset in referring expression comprehension (REC) and referring expression segmentation (RES) tasks. There are some visualization results in the Figure 9. There are four rows of results in Figure 9: referring expressions, images with bounding boxes(the red is the ground truth, and the blue is the prediction), ground truth of masks, prediction of masks.

Figure 9

Data Format

Download the RoboRefIt! Unzip the file in RoboRefIt/data/final_dataset/. There are train set, testA set and testB set.

Each json file describe the full information to load instances of these sets. The json files contains:

  1. "num", the number of the instance.
  2. "text", the referring expression of the instance to describe the instance.
  3. "bbox", the vertex coordinates of the upper left corner and the vertex coordinates of the lower right corner of the bounding box of the instance.
  4. "rgb_path", the path of related RGB image for the text. For example, "final_dataset\\train\\image\\0000000.png".
  5. "depth_path", the path of related depth image for the text. For example, "final_dataset\\train\\depth\\0000000.png".
  6. "mask_path", the path of related mask image for the text. For example, "final_dataset\\train\\mask\\0000000.png".
  7. "scene", the scene type of the image.
  8. "class", the object type of the instance.

Moreover, there is a "depth_colormap" file, which store corresponding color map of the depth image.