7.3. Output Data Specification

Define inference (predict) output data format specification for each tasks.

7.3.1. Type of inference output data

There are two type of inference output data.

7.3.1.1. Tensor (np.array)

Tensor type output is post processed multi dimension array.
The output don’t include meta data. When you handle this type output for application, you need more information such as class names, the input image size before resizing.

7.3.1.2. json

json type output include meta-data and well done for application.
For example, predicted bounding boxes already resized to fit raw input image size.

7.3.2. Tasks

Currently we have only 3 image tasks. I will add more image tasks and audio and more.

  • Image

    • Classification: IMAGE.CLASSIFICATION

    • Object Detection: IMAGE.OBJECT_DETECTION

    • Semantic Segmentation: IMAGE.SEMANTIC_SEGMENTATION


7.4. Format Specification

Define output data format.

7.4.1. IMAGE.CLASSIFICATION

7.4.1.1. Tensor

Output shape is [batch size, number of class] .
In a image, each value is in the range (0, 1), and all the value add up to 1.

7.4.1.2. json

“results” include batch size prediction result.

{
    "classes": [
        {
            "id": 0,
            "name": "airplane"
        },
        {
            "id": 1,
            "name": "automobile"
        },
        ...
    ],
    "date": "2018-06-08T11:43:38.759582+00:00",
    "results": [
        {
            "file_path": "/home/lmnet/tests/fixtures/datasets/lm_things_on_a_table/Data_multi/30298aabba35f9bde39cf4a08af11bd9.jpg",
            "prediction": [
                {
                    "class": {
                        "id": 0,
                        "name": "airplane"
                    },
                    "probability": "0.08696663"
                },
                {
                    "class": {
                        "id": 1,
                        "name": "automobile"
                    },
                    "probability": "0.049272012"
                },
                ...
            ]
        },
        {
            "file_path": "/home/lmnet/tests/fixtures/datasets/lm_things_on_a_table/Data_multi/30298aabba35f9bde39cf4a08af11bd9.jpg",
            "prediction": [
                {
                    "class": {
                        "id": 0,
                        "name": "airplane"
                    },
                    "probability": "0.08696663"
                },
                {
                    "class": {
                        "id": 1,
                        "name": "automobile"
                    },
                    "probability": "0.049272012"
                },
                ...
            ]
        }
    ],
    "task": "IMAGE.CLASSIFICATION",
    "version": 0.2
}

7.4.2. IMAGE.OBJECT_DETECTION

7.4.2.1. Tensor

Shape is [batch size, number of predict boxes, 6(x(left), y(top), w, h, class_id, score)].
For example YOLOv2 network, The post process FormatYOLOV2 convert bare convolution network output to the shape. Then, all other post process for object detection (ie. NMS) assume input as the shape and output the shape.

7.4.2.2. Json

“results” include batch size prediction result.
”box” order is (x(left), y(top), width, height) , the size is fitted to raw input image size.

{
    "classes": [
        {
            "id": 0,
            "name": "hand"
        },
        {
            "id": 1,
            "name": "salad"
        },
        {
            "id": 2,
            "name": "steak"
        },
        ....
    ],
    "date": "2018-06-08T12:07:41.531325+00:00",
    "results": [
        {
            "file_path": "/home/lmnet/tests/fixtures/datasets/lm_things_on_a_table/Data_multi/e2c58de56674f95d2edd4c3b4de7c39f.jpg",
            "prediction": [
                {
                    "box": [
                        299.9669580459595,
                        155.85418617725372,
                        23.924002647399902,
                        24.240950345993042
                    ],
                    "class": {
                        "id": 3,
                        "name": "whiskey"
                    },
                    "score": "0.10005419701337814"
                },
                {
                    "box": [
                        443.9669580459595,
                        443.8541861772537,
                        23.924002647399902,
                        24.240950345993042
                    ],
                    "class": {
                        "id": 3,
                        "name": "whiskey"
                    },
                    "score": "0.10005419701337814"
                }
            ]
        },
        {
            "file_path": "/home/lmnet/tests/fixtures/datasets/lm_things_on_a_table/Data_multi/a3e53b8217b60e7754b8031f929034f3.jpg",
            "prediction": [
                {
                    "box": [
                        299.9669580459595,
                        155.85418617725372,
                        23.924002647399902,
                        24.240950345993042
                    ],
                    "class": {
                        "id": 3,
                        "name": "whiskey"
                    },
                    "score": "0.10005419701337814"
                },
                {
                    "box": [
                        443.9669580459595,
                        443.8541861772537,
                        23.924002647399902,
                        24.240950345993042
                    ],
                    "class": {
                        "id": 3,
                        "name": "whiskey"
                    },
                    "score": "0.10005419701337814"
                }
            ]
        },
    ],
    "task": "IMAGE.OBJECT_DETECTION",
    "version": 0.2
}

7.4.3. IMAGE.SEMANTIC_SEGMENTATION

7.4.3.1. Tensor

Shape is [batch size, height, width, number of class]. The height and width is network input image size. In a image, each value is in the range (0, 1). In a pixel, all value add up to 1.

7.4.3.2. Json

“results” include batch size prediction result.
”prediction” include each class result.
”mask” is base64 encoded gray-scale PNG image, the range (0, 255) that is scaled (0, 1) by 255. the size is fitted to raw input image size.

{
    "classes": [
        {
            "id": 0,
            "name": "sky"
        },
        {
            "id": 1,
            "name": "building"
        },
        ...
    "date": "2018-06-08T12:33:33.681521+00:00",
    "results": [
        {
            "file_path": "/home/lmnet/tests/fixtures/camvid/0001TP_006870.png",
            "prediction": [
                {
                    "class": {
                        "id": 0,
                        "name": "sky"
                    },
                    "mask": "iVBORw0KGgoAAAANSUhEUgAAAeAAAAFoCAAAAACqXHf8AAAlhElEQVR4nO2deWCcVbnwfzOZ7Jkkk31PG9Is3dNSSlsKlCKlLAoKCCgo6MUFXBC9blf9rgvq5yfqvaJXVC6owEWwQsuVUqGF0qaFLknXpGnTptn3fZns3x/ZJ7O8885531l6fn/Nct73PMkz57znPOdZDEV0ZvM+WQ1WBsHIGIQPYEygGeLbgHTqIunDNRY6Z1"
               },
               ...
            ]
        }
    ],
    "task": "IMAGE.SEMANTIC_SEGMENTATION",
    "version": 0.2
}