Real Time Object Detection with Distance and Voice Alert

13 min readJun 12, 2021

It’s a known fact that estimated number of visually impaired person in the world is about 285 million, approximately equal to the 20% of the Indian Population. They suffer regular and constant challenges in Navigation especially when they are on their own. They are mostly dependent on someone for even accessing their basic day-to-day needs. So, it’s a quite challenging task and the technological solution for them is of utmost importance and much needed. One such try from our side is that we came up with an Integrated Machine Learning System which allows the Blind Victims to identify and classify Real Time Based Common day-to-day Objects and generate voice feedbacks and calculates distance which produces warnings whether he/she is very close or far away from the object. The same system can be used for Obstacle Detection Mechanism.

A Detailed Overview of the System

Technical Setup

Initially,we tried to implement the system on an IOT based very famous hardware module popularly known as RASPBERRY-PI , but amid lockdown period couldn’t make it. Would surely work on that as well , but for the time being we have implemented that via Third Party Applications. Hence,

The system is set up in such a way where an android application(assuming you are implementing it on a android device) will capture real-time frames and will send it to a laptop based Networked Server where all the computations take place.
The Laptop Based Server will be using a pre-trained SSD detection model trained on COCO DATASETS. It will then test and the output class will get detected with an accuracy metrics.
After testing with the help of voice modules the class of the object will be converted into a default voice notes which will then be sent to the blind victims for their assistance.
Along with the object detection , we have used an alert system where approximate will get calculated. If that Blind Person is very close to the frame or is far away at a safer place , it will generate voice-based outputs along with distance units.

A very notable point is it uses Resnet based architectural approach for feature extraction. So, it makes the overall system quite interesting. Isn’t it?

Resnet Architecture

IMPLEMENTATION

Efficient Implementation of this Model depends upon the compatibility with python and library installation hurdles. To be honest , this was among one of the most challenging phase which I felt in building this project. Thankful to STACKOVERFLOW and Python Unoffiicial Binary Releases for having pre-builded files uploaded and you can just download it from here as per your system’s compatibility.

TENSORFLOW APIs

We have implemented it by using TensorFlow APIs.The advantage one have by using APIs is it provides us with a set of common operations. Because of which we don’t have to write the code for program from scratch.We can say they are quite helpful as well as efficient.APIs provides us convenience and hence they are time saver.The TensorFlow object detection API is basically a structure build for creating a deep learning network that solves the problems for object detection.There are trained models in their framework and they refer it as Model Zoo .This includes a collection of COCO dataset, the KITTI dataset, and the Open Images Dataset. Here, we are primarily focused on COCO DATASETS.

TensorFlow Object Detection API depends on the following libraries:

Protobuf 3.0.0
Python-tk
Pillow 1.0
lxml
tf-slim
slim
Jupyter notebook
Matplotlib
Tensorflow (1.15.0)
Cython
contextlib2
cocoapi

Make sure you’ve installed all the required Libraries as per your system’s requirement. We’ll need few more libraries for Voice Generation that we’ll come to know later in this blog. You can visit TensorFlow for a detailed Installation analysis.

GETTING READY WITH THE SYSTEM

TensorFlow Installation

A typical user can simply install it on Anaconda Prompt with following commands.

TensorFlow Installation

2. Now you need to download the TensorFlow model repository from

TensorFlow Repository Link

3.PROTOBUF COMPILATION

Next you have to Convert the .protos file to .py file extensions.

This can be done via Protobuf Compilation. For achieving this we we’ll need Google Protobuf Releases. Head on to the link and Download the protobuf version which satisfies your system compatibility. I prefer win_64 which supports my system. After downloading and extracting it make sure to add it’s path in the environment variable otherwise it’ll still give you errors as ‘proto is not recognised as an internal or external batch file’. So,make sure it is done!

MODELS

Now, a bunch of pre-trained models are with Tensorflow . You can use any one of them. They are pretty good and depending upon your system specifications you can choose one. For a faster accuracy you can go with SSD DETECTION and for better accuracy you can go with MASK RCNN but most of the system shows smooth performance with SSD Mobile_Net DETECTION . So, I’ll elaborate SSD ALGORITHM. You can check other models here:

SSD ARCHITECTURE

THE SSD ARCHITECTURE

SSD has two components: SSD head and a backbone model.

Backbone model basically is a trained image classification network as a feature extractor. Like ResNet this is typically a network trained on ImageNet from which the final fully connected classification layer has been removed.

The SSD head is just one or more convolutional layers added to this backbone and the outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations.We are hence left with a deep neural network which is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a lower resolution.

For an input image ,the backbone results in a 256 7x7 feature maps in ResNet34 . SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detecting objects basically means predicting the class and location of an object within that region.

Anchor box

Multiple anchor/prior boxes can be assigned to each grid cell in SSD. These assigned anchor boxes are pre-defined and each one is responsible for a size and shape within a grid cell.Matching phase is used by SSD while training, so that there’s an appropriate match to anchor box with the bounding boxes of each ground truth object within an image. For predicting that object’s class and its location the anchor box with the highest degree of overlap with an object is responsible.Once the network has been trained,this property is used for training the network and for predicting the detected objects and their locations. Practically, each anchor box is specified with an aspect ratio and a zoom level. Well,we know that all objects are not square in shape. Some are shorter ,some are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this.The different aspect ratios can be specified using ratios parameter of the anchor boxes associated with each grid cell at each zoom/scale level.

ZOOM LEVEL

It is not mandatory for the anchor boxes to have the same size as that of the grid cell.The user might be interested in finding both smaller or larger objects within a grid cell. In order to specify how much the anchor boxes need to be scaled up or down with respect to each grid cell ,the zooms parameter is used.

MOBILENET

This model is based on the ideology of THE MobileNet model based on depthwise separable convolutions and it forms a factorized Convolutions. These converts a basic standard convolutions into a depthwise convolutions.This 1 × 1 convolutions are also called as pointwise convolutions. For MobileNets to work, these depthwise convolutions applies a general single filter based concept to each of the input channels. These pointwise convolutions applies a 1 × 1 convolutions to merge with the outputs of the depthwise convolutions. As a standard convolution both filters combines the inputs into a new set of outputs in one single step. The depthwise identifiable convolutions splits this into two layers — a separate layer for the filtering purpose and the other separate layer for the combining purpose. This factorization methodology has the effect of drastically reducing the computation and that of the model size.

MOBILENET ARCHITECTURE

DEPTH ESTIMATION

Depth estimation or extraction feature is nothing but the techniques and algorithms which aims to obtain a representation of the spatial structure of a scene. In simpler words, it is used to calculate the distance between two objects. Our prototype is used to assist the blind people which aims to issue warning to the blind people about the hurdles coming on their way. In order to do this, we need to find that at how much distance the obstacle and person are located in any real time situation. After the object is detected rectangular box is generated around that object.

Distance Approximations

If that object occupies most of the frame then with respect to some constraints the approximate distance of the object from the particular person is calculated. Following code is used to recognize objects and to return the information of the distance and location.

Here, we have established a Tensorflow session comprised of Crucial Features for Detection. So, for further analysis iteration is done through the boxes. Boxes are an array, inside of an array. So, for iteration we need to define the following conditions.

Index of box in boxes array is represented by i. Analysis of the score of the box is done by index. It is also used to access class. Now the width of the detected object is measured. This is done by asking the width of an object in terms of pixels.

We got the center of two by subtracting the same axis start coordinates and dividing them by two. In this way the centre of our detected rectangle is calculated. And at the last, a dot is drawn in the centre. The default parameter for drawing boxes is a score of 0.5. if scores[0][i] >= 0.5 (i.e. equal or more than 50 percent) then we assume that the object is detected. if scores[0][i] >= 0.5:

In the above formula, mid_x is centre of X axis and mid_y is centre of y axis. If the distance apx_distance < 0.5 and if mid_x > 0.3 and mid_x < 0.7 then it can be concluded that the object is too close from the particular person. With this code, relative distance of the object from a particular person can be calculated. After the detection of object the code is used to determine the relative distance of the object from the person. If the object is too close then signal or a warning is issued to the person through voice generation module.

VOICE GENERATION MODULE

After the detection of an object, it is utmost important to acknowledge the person about the presence of that object on his/her way. For the voice generation module PYTTSX3 plays an important role. Pyttsx3 is a conversion library in Python which converts text into speech. This library works well with both Python 2 and 3. To get reference to a pyttsx. Engine instance, a factory function called as pyttsx.init() is invoked by an application. Pyttsx3 is a tool which converts text to speech easily.

This algorithm works as whenever an object is being detected, approximate distance is being calculated,with the help of cv2 library and cv2.putText() function, the texts are getting displayed on to the screen. To identify the hidden text in an image,we use Python-tesseract for character recognition. OCR detects the text content on images and encodes it in the form which is easily understood by the computer. This text detection is done by scanning and analysis of the image. Thus, the text embedded in images are recognized and “read” using Python-tesseract. Further these texts are pointed to a pyttsx.Engine instance, a factory function called as pyttsx.init() is invoked by an application. During construction, a pyttsx.driver.DriverProxy object is initialized by engine which is responsible for loading a speech engine driver from the pyttsx.drivers module. After construction, an object created by an engine is used by the application to register and unregister event callbacks; produce and stop speech; get and set speech engine properties; and start and stop event loops.

Audio commands are generated as output. If the object is too close then it states “Warning: The object (class of object) is very close to you. Stay alert!”. Else if the object is at a safer distance then then a voice is generated which says that “The object is at safer distance”. This is achieved with the help of certain libraries like pytorch, pyttsx3, pytesseract and engine.io .

Pytorch is primarily a machine learning library. Pytorch is mainly applied to the audio domain. Pytorch helps in loading the voice file in standard mp3 format. It also regulates the rate of audio dimension. Thus, it is used to manipulate the properties of sound like frequency, wavelength and waveform. The numerous availability of options for audio synthesis can also be verified by taking a look at the functions of Pytorch.

TESTING

Third Party App provides ease and freedom in the field of app development. It brings efficiency and also helps in fast delivery of the output. Third Party App allows you to divide your work in parts and helps you to focus on the core part of app or any system. This strategy helps in the development of good and quality software. We can pass on the Features of the Third Party App to the system.

1) At first, we are capturing real time images from the rear camera[4] of the mobile handset of blind people and a connection is established between mobile phone and system in laptop and then those images are sent from the mobile phone to laptop.

2)This connection is done by a Third party app which is installed in the mobile phone of the person. All the real time images which get captured by the rear camera of the mobile phone are first transferred to the Third party app in the mobile phone and then those images are sent in laptop where they are processed for some further conclusions.

3) The system in laptop will test it using its APIs and SSD ALGORITHM and it detects the confidence accuracy of the image which it is testing. We reached 98% accuracy for certain classes like books, cups, remote.

4) After testing the images we are generating an output on the laptop based system and its prediction is being translated into voice with voice modules and sent to the blind person with the help of wireless audio support tools.

OUTPUTS

Following is the video for the simple Testing of the model which we have created. It is expected that your proposed model must work in accordance with the proposed libraries and framework in the similar fashion.

Below are the objects on which it was tested and it gave the following result which were analysed further with the help matplotlib libraries.

THE ACCURACY OF CUP IS 99%

FINAL DISTANCE IS 0.2 UNITS FROM FRAME AND WARNING IS GENERATED WRT DISTANCE AS IT IS TOO CLOSER AND VOICE OUTPUT HEARD THAT IT’S CLASS NAME IS CUP. IT STATES (WARNING -CUP VERY CLOSE TO THE FRAME)

THE FINAL ACCURACY OF REMOTE IS 98%

FINAL DISTANCE IS 0.8 UNITS FROM FRAME AND NO DISTANCE BASD WARNING IS GENERATED AS IT IS AT A SAFER DISTANCE AND CLASS RECOGNITION VOICE CAN BE HEARD AND THE CLASS IS REMOTE.

THE FINAL ACCURACY OF BED IS 98%

FINAL DISTANCE IS 0.9 UNITS FROM FRAME AND NO DISTANCE BASED WARNING AS IT IS AT A SAFER DISTANCE RATHER CLASS RECOGNITION VOICE IS GENERATED AND THE NAME OF THE OBJECT CAN BE HEARD AS IT IS A BED.

THE FINAL ACCURACY OF CHAIR IS 96%

FINAL DISTANCE IS 0.7 UNITS FROM FRAME AND NO DISTANCE BASD WARNING IS GENERATED AS IT IS AT A SAFER DISTANCE AND CLASS RECOGNITION VOICE CAN BE HEARD AND IT’S CLASS IS CHAIR. IT CAN DETECT MULTIPLE FRAMES.

THE FINAL ACCURACY OF TV IS 96%

FINAL DISTANCE IS 0.8 UNITS FROM FRAME AND NO DISTANCE BASED WARNING AS IT IS AT A SAFER DISTANCE RATHER CLASS RECOGNITION VOICE IS GENERATED AND THE NAME OF THE OBJECT CAN BE HEARD AS IT IS A TV.

CONSOLES VIEW

IT CALCULATES DISTANCE AND SHOWS ITS OUTPUT ON CONSOLES AND THE BLIND PERSON CAN HEAR THESE DISTANCE VALUES ALONG WITH OBJECT BASED WARNING.

PERFORMANCE ANALYSIS

We have plot a graph of Performance Metrics w.r.t Accuracy . It can be seen as follows:

X-Axis: Time Frame(in Sec.)

Y-AXIS:Accuracy Metrics(in percentage)

CUP

REMOTE

BED

CHAIR

The proposed system successfully detects 90 objects,labels them and also shows its accuracy.The model also calculates the distance from the object to the camera and gives a voice feedback as when the person with the camera is approaching the object.The dataset was tested on two different models, SSD Mobilenet V1 and SSD Inception V2.However the SSD Mobilenet V1 model showed less latency and was faster in detecting objects.