Automated text simplification to increase access to Web information for people with cognitive disabilities.


Jim Martin and Clayton Lewis*
Department of Computer Science and
Institute of Cognitive Science
University of Colorado, Boulder

*Coleman Institute for Cognitive Disabilities

Students: Assad Jarrahian and Kirill Kireyev

Support from Google


Opportunity. More than 21 million Americans have some form of cognitive disability: a cognitive impairment that can limit their economic, social, and individual activities. Causes include Down syndrome and other developmental disabilities, traumatic brain injury, stroke, Alzheimer disease and other forms of dementia, and some severe mental illnesses. Most of these people are among the roughly 22% of  Americans who score in the lowest range (level 1 of 5 levels) on national literacy assessments; see http://nces.ed.gov//naal/resources/execsumm.asp#litskills

This project will explore the extension of leading ML-based NLP methods to offer simplified text to people with cognitive disabilities and others with low literacy. We believe that the technology base exists for automatically transforming text to reduce the size and complexity of vocabulary, and to reduce syntactic complexity that is problematic for some readers with cognitive limitations, such as sentence embedding and passive constructions. Further, since there are large individual differences among people with the same or different forms of disability, we want to explore the potential for individualizing the text transformations. For example, typical vocabulary development for people with Down syndrome and Williams syndrome, both chromosomal abnormalities producing significant cognitive impairment, is quite different (Fowler, 1998).

This work will contribute to Google’s mission of making the world’s information universally accessible and useful. Today, while general guidelines exist for the Web that call for attention to cognitive accessibility, the effort devoted to this, and the quantity of materials that have actually been produced, are quite limited. The state of the art is expensive manual reworking of text by skilled editors. Automatic text simplification will make it much easier and less expensive to produce cognitively accessible materials.

Technical approach. The same general approach being successfully applied in Google’s machine translation work is applicable in principle to text transformations of other kinds. The approach is to build two statistical models, one of the relationship between source and target languages, and one of the regularities of the target language, and tune the application of these models to produce best-fitting output as measured by a quality estimate, BLEU in the MT case. The active area of text summarization is a near neighbor to our application, and evaluations parallel to those in MT, using the related ROUGE quality estimate, are being carried out at the Document Understanding Conferences (DUC) competitions, though to date most work shows a good deal more reliance on linguistic analysis of various kinds than Google’s statistical MT work (see http://www-nlpir.nist.gov/projects/duc/pubs.html; but also Knight and Marcu, 2001 ).

This statistical modeling work requires large corpora of two kinds, a corpus of parallel texts (pairs representing the same sentences in Chinese and English, for example, for MT) and a corpus of sentences in the target language. While such corpora are available for MT, creating a parallel corpus for text summarization requires ingenuity (see eg Marcu, 1999).

The corpus problem is still more challenging for text simplification, where the target output has to have characteristics different from that of the language as a whole, having a restricted vocabulary, and excluding some syntactic constructions. This means that existing language models for English can’t be used. By cooperation with TheArcLink, a nonprofit organization that produces simplified descriptions of Medicaid programs by manual editing, we will have access to a small parallel corpus, that also includes a corpus of simplified English as defined operationally by that organization. But this corpus will be too small—only a few hundred texts—to drive statistical model creation adequately. Corpus limitations have held previous text simplification work (eg Carroll et al. 1999; PSET Project http://osiris.sunderland.ac.uk/~pset ) to non-statistical approaches.

We want to explore three approaches to breaking through this corpus limitation. First, we will create a corpus of simplified English by selection: screening large volumes of text, and retaining only those passages that meet our vocabulary and syntax restrictions. A subproblem will be developing a screener for syntax that requires minimal analysis. We will then build a language model for the screened text.
 
Second, we will examine the behavior of a full-English language model when it is artificially constrained to eliminate designated vocabulary items. On its face, this method would only deal with vocabulary restrictions, not syntactic simplification. But it may in fact be possible to obtain some degree of syntactic simplification by blocking the use of selected vocabulary, such as relative pronouns and past participles (used in passives).

Our third approach addresses the lack of a parallel corpus for text simplification. We will experiment with the use of statistically-derived vector models of semantics to represent the meaning of a to-be-simplified text in a high-dimensional space. It is a well-defined geometric problem to find the collection of words in a restricted vocabulary whose meaning is the best approximation to the meaning of a target text. Given a collection of words (the restricted vocabulary), the corresponding vectors define a subspace S of the overall space. Given a new text, the best approximation of that text using only words in the restricted vocabulary can be calculated by projecting the vector representing the new text onto S. Since this projection may not correspond to the meaning of any actual collection of words (texts containing whole number multiples of words do not cover the whole subspace) we will use search to find the best approximating text.

The result of this projection or search process is a bag of words that must then be rendered as prose. We will treat this problem as a further search, this time using our language model for simplified English. The objective is to find the highest-likelihood string of words, given the statistics in the model, that includes the items in the bag.

We can benefit greatly in this work from access to Google’s informational and computational resources, and from the opportunity to interact with Google’s language researchers, who may be able to help us devise further approaches to pursue.

References.

Carroll, J., Minnen,G., Pearce, D., Canning, Y., Devlin, S., and Tait, J. (1999) Simplifying text for language-impaired readers. In Proc. EACL’99, 269-270.
Fowler, A.E. (1998) Language in mental retardation. In Burack, Hodapp, and Zigler (Eds.) Handbook of Mental retardation and Development. Cambridge, UK: Cambridge University Press, 290-333.
Knight, K. and Marcu, D. (2002) Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139, 91-107.
Marcu, D. (1999) The automatic construction of large-scale corpora for summarization research. Proc SIGIR, 137-134.