Automated text simplification to increase access to Web information for people with cognitive disabilities.
Jim Martin and Clayton Lewis*
Department of Computer Science and
Institute of Cognitive Science
University of Colorado, Boulder
*Coleman Institute for Cognitive Disabilities
Students: Assad Jarrahian and Kirill Kireyev
Support from Google
Opportunity. More than 21 million Americans
have some form of cognitive disability: a cognitive impairment that can
limit their economic, social, and individual activities. Causes include
Down syndrome and other developmental disabilities, traumatic brain
injury, stroke, Alzheimer disease and other forms of dementia, and some
severe mental illnesses. Most of these people are among the roughly 22%
of Americans who score in the lowest range (level 1 of 5 levels)
on national literacy assessments; see http://nces.ed.gov//naal/resources/execsumm.asp#litskills
This project will explore the extension of leading ML-based NLP methods
to offer simplified text to people with cognitive disabilities and
others with low literacy. We believe that the technology base exists
for automatically transforming text to reduce the size and complexity
of vocabulary, and to reduce syntactic complexity that is problematic
for some readers with cognitive limitations, such as sentence embedding
and passive constructions. Further, since there are large individual
differences among people with the same or different forms of
disability, we want to explore the potential for individualizing the
text transformations. For example, typical vocabulary development for
people with Down syndrome and Williams syndrome, both chromosomal
abnormalities producing significant cognitive impairment, is quite
different (Fowler, 1998).
This work will contribute to Google’s mission of making the
world’s information universally accessible and useful. Today,
while general guidelines exist for the Web that call for attention to
cognitive accessibility, the effort devoted to this, and the quantity
of materials that have actually been produced, are quite limited. The
state of the art is expensive manual reworking of text by skilled
editors. Automatic text simplification will make it much easier and
less expensive to produce cognitively accessible materials.
Technical approach. The same general approach
being successfully applied in Google’s machine translation work
is applicable in principle to text transformations of other kinds. The
approach is to build two statistical models, one of the relationship
between source and target languages, and one of the regularities of the
target language, and tune the application of these models to produce
best-fitting output as measured by a quality estimate, BLEU in the MT
case. The active area of text summarization is a near neighbor to our
application, and evaluations parallel to those in MT, using the related
ROUGE quality estimate, are being carried out at the Document
Understanding Conferences (DUC) competitions, though to date most work
shows a good deal more reliance on linguistic analysis of various kinds
than Google’s statistical MT work (see http://www-nlpir.nist.gov/projects/duc/pubs.html; but also Knight and Marcu, 2001 ).
This statistical modeling work requires large corpora of two kinds, a
corpus of parallel texts (pairs representing the same sentences in
Chinese and English, for example, for MT) and a corpus of sentences in
the target language. While such corpora are available for MT, creating
a parallel corpus for text summarization requires ingenuity (see eg
Marcu, 1999).
The corpus problem is still more challenging for text simplification,
where the target output has to have characteristics different from that
of the language as a whole, having a restricted vocabulary, and
excluding some syntactic constructions. This means that existing
language models for English can’t be used. By cooperation with
TheArcLink, a nonprofit organization that produces simplified
descriptions of Medicaid programs by manual editing, we will have
access to a small parallel corpus, that also includes a corpus of
simplified English as defined operationally by that organization. But
this corpus will be too small—only a few hundred texts—to
drive statistical model creation adequately. Corpus limitations have
held previous text simplification work (eg Carroll et al. 1999; PSET
Project http://osiris.sunderland.ac.uk/~pset ) to non-statistical approaches.
We want to explore three approaches to breaking through this corpus
limitation. First, we will create a corpus of simplified English by
selection: screening large volumes of text, and retaining only those
passages that meet our vocabulary and syntax restrictions. A subproblem
will be developing a screener for syntax that requires minimal
analysis. We will then build a language model for the screened text.
Second, we will examine the behavior of a full-English language model
when it is artificially constrained to eliminate designated vocabulary
items. On its face, this method would only deal with vocabulary
restrictions, not syntactic simplification. But it may in fact be
possible to obtain some degree of syntactic simplification by blocking
the use of selected vocabulary, such as relative pronouns and past
participles (used in passives).
Our third approach addresses the lack of a parallel corpus for text
simplification. We will experiment with the use of
statistically-derived vector models of semantics to represent the
meaning of a to-be-simplified text in a high-dimensional space. It is a
well-defined geometric problem to find the collection of words in a
restricted vocabulary whose meaning is the best approximation to the
meaning of a target text. Given a collection of words (the restricted
vocabulary), the corresponding vectors define a subspace S of the
overall space. Given a new text, the best approximation of that text
using only words in the restricted vocabulary can be calculated by
projecting the vector representing the new text onto S. Since this
projection may not correspond to the meaning of any actual collection
of words (texts containing whole number multiples of words do not cover
the whole subspace) we will use search to find the best approximating
text.
The result of this projection or search process is a bag of words that
must then be rendered as prose. We will treat this problem as a further
search, this time using our language model for simplified English. The
objective is to find the highest-likelihood string of words, given the
statistics in the model, that includes the items in the bag.
We can benefit greatly in this work from access to Google’s
informational and computational resources, and from the opportunity to
interact with Google’s language researchers, who may be able to
help us devise further approaches to pursue.
References.
Carroll, J., Minnen,G., Pearce, D., Canning, Y., Devlin, S., and Tait,
J. (1999) Simplifying text for language-impaired readers. In Proc. EACL’99, 269-270.
Fowler, A.E. (1998) Language in mental retardation. In Burack, Hodapp, and Zigler (Eds.) Handbook of Mental retardation and Development. Cambridge, UK: Cambridge University Press, 290-333.
Knight, K. and Marcu, D. (2002) Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139, 91-107.
Marcu, D. (1999) The automatic construction of large-scale corpora for summarization research. Proc SIGIR, 137-134.