In this task, I will focus on three different levels of searching for objects within videos:
-
Level 1: Find similar objects with no properties:
A truck
. -
Level 2: Find object with color property:
The red truck
. -
Level 3:
Find this person
Eventually, I locate all frames and draw bounding boxes around the finding object X in the videos, and then export these frames as JPG files.
The structure of the output folders is as follows:
- Video 1
- Object X
- Frame 15.jpg
- Frame 32.jpg
- Frame 120.jpg
- Object X
- Video 2
- Object X
- Video 3
- Object X
- Frame 215.jpg
- Object X
I recommend creating an anaconda environment:
conda create --name [environment-name] python=3.9
Then, install Python requirements:
pip install -r requirements.txt
Finally, to reproduce the results, you first have to download the provided example videos here. Then from the [environment-name]
project root, run:
python demo.py
At this level, I employ YOLOv8
model to detect all objects in the video, and subsequently I extract and draw bounding boxes exclusively around objects classified as truck
.
Moving to the next stage, I commence by replicating the procedures of Level 1. utilizing YOLOv8
model to extract truck
objects. Furthermore, in this task I employ the large segmentation YOLOv8 model(yolov8l-seg.pt) for all three levels. This choice is made not only to enhance the prediction accuracy due to its larger size but also it has the capability to generate masks for the detected objects, for example:
In the event that the background contains elements with a similar color to the object, I further enhance accuracy by extracting the detected object based on its mask and applying a color detection algorithm as follows:
To determine whether the pixel values of the object fall within the red color range, I check if the values for the blue and green channels are in the range (0, 50) and for the red channel are in the range (120, 255). Subsequently, I obtain the following red mask:
Eventually, I can determine whether the detected truck is red by calculating the ratio of red pixels to the total object's pixels and setting a specific threshold for it.
At this final stage, I incorporate the use of the YOLOv8
model and Detector-Free Local Feature Matching with Transformers model (LoFTR for short), you can find their paper here.
- The first task follows similar procedures to those of Level 1, but it focuses on
human
class. - Next step is to identify the similarities between the target person (input for this task) and the detected person. LoFTR identifies and extracts keypoints from the given image and the detected human. It then establishes mappings between pairs of keypoints and provides confidence scores for these pairs, you will have a deeper understanding through the following example:
- Subsequently, I check if the number of confidence scores greater than 0.5 satisfies a particular threshold (I use a threshold of 65 in my code). Eventually, I employ YOLOv8 model to track ID of the detected human. If the model loses track of the person, the process will start over.
In this section, I will provide an overview of the results from the provided examples, which you can access and download from here. Furthermore, please access the result frames for each video level via the following link.
At the final level, you may want to see the full video result via this link.