You are here

Text Detection in Scene Images

We develop algorithms devoted to the detection of text embedded in scenes, its segmentation from the background and its adjustment to facilitate its readability by an OCR engine. 

what is text in scene

Since text conveys semantic information, the reading of text in images plays an important role in the image content understanding. The image can be a scanned document, where text is the dominant element, a frame of a video with overlaid captions, or an image in which text is naturally embedded.

We can divide text's appearance in images (and video frames) into two macro categories: overlaid text and scene text. The former group includes text that is superimposed over the image, like timestamps, captions, titles. This text is deliberately present in the image.
On the contrary, scene text is inherently embedded within the scene, for example hotel or shop placards, road signs, street names, posters. Due to this natural presence, such text can manifest itself in a wide range of conditions, depending upon several factors related to the scene and the acquisition process. This fact, in general, makes its detection and reading a very challenging task.


Scene text detection and reading plays an important role in several applications, such as indexing of multimedia archives, recognizing signs in driver assisted systems, providing scene information to visually impaired people, identifying vehicles by reading their license plates. With the explosion and widespread diffusion of low-priced digital cameras and mobile phones endowed with good quality cameras, text extraction from camera-captured scenes has gained a renewed attention in computer vision research.

our approach

The first step relies on an intensity normalization process which improves image details and the local contrast in shadowed regions. Intensity normalization is achieved by the computation of the divisive local contrast.
Two thresholds are determined by taking into account the shape of the histogram of the normalized image: These are used to compute two binary maps which should contain, respectively, positive and negative contrasting text, if present.
The connected components of these bitmaps are analyzed separately: Their shape features (area, elongations, convexity...) and the correspondent gradient in the input image are analysed by a cascade of attribute filters to mark likely non-text components as non-interesting. Few thresholds are read from a "prior knowledge" file related to the scenario (text on athletes' bibs, book covers, text in city,...)
In order to extract text lines, the survived components, eventually by splitting them, are recursively clustered according to proximity, alignment and size similarity, until a termination criterion is satisfied. Clusters which potentially contain a single text line are considered.
Once a cluster is accepted as a candidate text-line, all of the components inside this region, which were previously marked as non-interesting, are reconsidered for possible restoration before the text recognition phase.
The OCR is the last filter to reject non-text clusters using as filter the confidence of the engine output. If a scenario is endowed with a dictionary of expected terms, then a string matching criterion can be applyed to label with high confidence the content of the scene.


We carried on our experiments on:

  • a labeled database of 1003 book covers acquired by a CCD camera, where text is possibly typed with different font, background and slope on the same cover [doi]
  • a database of 249 171 video frames of athletic events where text of interest is on the athletes' bibs [doi]
  • a database of 134 pictures and 14 video clips acquired with a mobile phone of scene including steet plates [doi]

Reference publications:

S. Messelodi, C.M. Modena, L. Porzi, P. Chippendale: i-Street: Detection, Identification, Augmentation of Street Plates in a Touristic Mobile Application. Image Analysis and Processing, LNCS 9280, pp. 194-204, ICIAP 2015

S. Messelodi, C.M. Modena: Scene Text Recognition and Tracking to Identify Athletes in Sport Videos. Multimedia Tools and Applications, Special Issue on Automated Media Analysis and Production for Novel TV Services, Vol. 63, No. 2, pp. 521-545, 2013

S. Messelodi, C.M. Modena: Automatic Identification and Skew Estimation of Text Lines in Real Scene Images. Pattern Recognition, Vol. 32, No. 5, pp. 789-808, 1999

S. Messelodi, C.M. Modena: Context Driven Text Segmentation and Recognition. Pattern Recognition Letters, Vol. 17, No. 1, pp. 47-56, 1996