
magicplan-AI, a journey into the wonderful world of Deep Learning (Part 2/3)
Share
In part 1, we saw how today’s Deep Learning tools and data ecosystems make it easy to have an early prototype to assess the feasibility of a common Deep Learning task. That said, it is one thing to have a workable prototype showing the potential of the approach, it is another thing to reach a reliable enough level of detection to allow the feature to be put in the hands of millions of users.
Iterating on data and model to reach acceptable performance
Introducing relevant performance metrics
Improving / augmenting the training dataset
Playing with the loss function
Training correctly a model in Deep Learning is as much about the right dataset as it is about the right correction when an error is found during the training.
Fortunately for us, a large literature is available on identifying the right Loss function to use. Even better, on the particular case of Object detection, Facebook Detectron project identified some key improvement in the way of applying the right Loss function, called Focus Loss, that were very easy to implement for us.
As a result, combining the training database quality improvement with the introduction of a better fitted loss function, we were able to significantly improve the F1 score as illustrated below.

F1 score evolution according to Training set fixes
Exploring different architectures
Academic research has been quite active in the Object Detection field and several architectures are available for Deep Learning for Object Detection. They can be grouped along two axes:
A — the type of feature extractor that processes the input image:
Multiple architectures from lite (MobileNet) to heavy (Inception, VGG, ResNet)…
the bigger the feature extractor in terms of parameters, the better the descriptors but the more memory and time it takes to perform
B — the number of steps to do the full detection:
direct forward approach (YOLO, SSD) where one single network will detect the bounding boxes and classify them at the same time,
two steps approach (Faster RCNN, R-FCN) where a first network detects potential rough bounding boxes candidates while the second performs classification and bounding boxes fine tuning.
As expected, the more complex the architecture, the better the performance (see graphic below).

Object Detection Architectures and performances (COCO dataset) — Source
However, what we discovered quite early is that even for the inference task (the task of running the model to perform the object detection on an image — not the training task which requires much more resources), not all architectures fit the constraints of running on a mobile device.
Two reasons to that:
contrary to modern GPU boards with more than 10Gb of RAM, even the latest iPhone X has only 3Gb of RAM,
real-time constraint means that we can’t afford an object recognition lasting more than 1.0 second without creating a really annoying lag in the user experience.
Some architectures do not fit in memory on the device. Some other do BUT it takes several seconds to process one object detection, which is not acceptable in the magicplan real-time capture scenario.

Evaluation of several architectures
Lessons learnt
Contrary to the first “quick & easy” stage, being able to play with all the options in the “off the shelf” models requires several conditions:
a good methodology to objectively measure the progress,
a good understanding of what makes a good training convergence and what can break it,
a good understanding of the underlying neural nets architecture and nodes,
In our case, this could not have happened without the presence of two full time PhDs in artificial intelligence / deep learning in the team who master these challenges.
Coming next
At this point we have a best in class model doing a good job in object detection but too big to run on any modern smartphone. In last part, we will describe more in detail the work required to move from a PC based solution to a smartphone embedded one.
Sam Miller
RevOps Manager
Inside magicplan
5 min read










