Curriculum Vitae (pdf)
Rodney M. Goodman
B.Sc., Ph.D., C.Eng., SMIEEE, FIEE.
Keyword Spotting for Cursive Document Retrieval
Title: Keyword Spotting For Cursive Document Retrieval
Authors: Trish Keaton, Rodney Goodman
Abstract: We present one of the first attempts
towards automatic retrieval of documents, in the noisy environment
of unconstrained, multiple author, handwritten forms. The documents
were written in cursive script for which conventional OCR and text
retrieval engines are not adequate. We focus on a visual word spotting
indexing scheme for scanned documents housed in the Archives of
the Indies in Seville, Spain. The framework presented utilizes pattern
recognition, learning and information fusion methods, and is motivated
from human word-spotting studies.
Motivation & Aims
The goal of this research is to develop a visual word spotting and
indexing scheme for the archival and retrieval of scanned historical
documents housed in the Archives of the Indies in Seville, Spain.
These documents were written in cursive script by multiple authors,
and are hundreds of years old (many of which date back to Columbus's
era). There exists a tremendous need for scholars to constantly
search and explore the contents of such archives. However, conventional
OCR and text retrieval engines are inadequate for such tasks. Existing
OCR systems often rely upon the ability to cleanly segment the words
prior to recognition. The documents in our database exhibit many
problems which would certainly cause such systems to fail. We must
contend with noise introduced by the photocopying and scanning processes,
as well as stray marks, underlines, and overlapping words. Under
these conditions perfect segmentation would be impossible. We have
developed an alternative strategy for the indexing and retrieval
of such documents based on learning a set of keyword signatures
of particular words of interest.
Our approach applies many standard image processing techniques in
the preprocessing of the documents, and the extraction of the spatial
characteristics of the words. In addition, we attempt to characterize
words via signatures motivated from human word spotting experiments.
The recognition strategy is based upon probabilistic signature matching,
in which we view the entire word globally, rather than segmenting
and recognizing the individual letters of the word. We investigate
the ability to use such signatures, together with advanced encoding
schemes and learning, to facilitate the spotting of keywords in
handwritten cursive documents. Approach
Focus-of-attention :We avoid page segmentation problems by incorporating
a focus-of-attention module, to identify candidate locations prior
to performing the word-level matching. This step involves normalized
cross-correlation of the document image with a set of keyword prototypes
(templates) which have been extracted from a training set of documents.
A set of candidate locations is extracted, with the different locations
ranked by correlation strength. The locations of the top correlation
peaks are then passed along to the preprocessing stage.
back to Information