;[6{   47 CHAPTER 3 REVIEW OF THE COMPUTING LITERATURE  3.1. Introduction [48{ [6{In  this  chapter  the  computing  literature  relating  to  the  detection  and [48{ [6{  correction of spelling errors is surveyed. [48{ [6{  3.2. Detection v. Correction [48{ [6{  There is a  distinction that needs  to be made  between programs that  do [48{ [6{  error  detection  and  those  that  do  error  correction.  Error  detection [48{ [6{  involves  the  isolation  of  misspelt  words  (the  misspelling  may  even  take  the [48{ [6{  form of  another word)  or tokens  in a  text  or program.  Error  correction [48{ [6{  involves  taking  a  misspelling  that  has  been  detected  and  indicating  a [48{ [6{  correction, or likely corrections. [48{ [6{  Some  computer  programs  are  primarily  concerned  with  detection  of [48{ [6{  errors, and rely on the user to  make the correction ('self-correction')  for [48{ [6{  example:  UNIX  'TYPO'  (McMahon  et  al,  1978);  UNIX/IBM  'SPELL'  (Peterson, [48{ [6{  1980a),  and  SPELLWWB  (Macdonald,  Frase  et  al.,  1982).  Conversely,  a [48{ [6{  correction  program  can  be  used  to  correct  an  error  once  it  is  detected, [48{ [6{  or  on  request:  e.g.  SPELLTELL  (Macdonald,  Frase  et  al.,  1982).  Most [48{ [6{  spelling  correction  programs,  however,  carry  out  both  error  detection  and [48{ [6{  correction. [48{ [6{3.2.1. Applications [48{ [6{Spelling  error  correction  programs  have  been  used  for  a  variety  of [48{ [6{  applications: - English  prose  e.g  'writers'  workbench'  (Macdonald,  Frase  et  al.,  1982); [6{   48 [48{ [6{  - children's interactions with the  Plato educational system  (Tenczar  and Golden, 1962); - text  input  by  computer  users  e.g.  DEC-10  SPELL  (Peterson,  1980a); - user  interface  applications,  especially  computer  mail  systems  (Durham et al, 1983); - large databases in Hebrew and English (Mor and Fraenkel, 1982);  - chemical abstracts (Pollock and Zamora, 1984);  - names  of  passengers  in  airline  reservation  systems  (Davidson,  1962); - retrieval of names from genealogical databases (Munnecke, 1980);  - spelling  errors  made  in  writing  LISP  programs,  including  misspelt  keywords (Teitelman, 1978);  - errors made in writing SOLO programs (Lewis, 1980).  Of  most  relevance  here  are  those  concerned  with  the  correction  of [48{ [6{  spelling  errors  in  English  text,  although  other  applications  will  also  be [48{ [6{  considered where of interest. [48{ [6{  3.2.2. Modes of Use - Interactive Programs [48{ [6{  There  are  a  number  of  different  modes  of  use  of  programs.  A [48{ [6{  complete  file  of  text  or  code  may  be  processed  by  the  program,  and [48{ [6{  then  information  about  errors  provided  to  the  user:  the  UNIX  SPELL [48{ [6{  program  takes  a  file  of  text  and  outputs  all  words  that  it  does  not [48{ [6{  recognise;  Pollock  &  Zamora  (Pollock  and  Zamora,  1984)  test  a  program  to [48{ [6{  detect  and  correct  errors  in  chemical  abstracts;  Atwell's  correction [48{ [6{  program  looks  for  errors  in  a  large  corpus  of  texts  (Atwell,  1983). [48{ [6{  Alternatively,  a  program  may  be  used  interactively,  permitting  the  user  to [48{ [6{  check  or  correct  a  word  at  the  time  he  writes  it  or  as  he  proof-reads [48{ [6{  it:  the  Interlisp  spelling  corrector  can  intervene  when  the  programmer [48{ [6{  makes  an  error,  and  can  make  a  correction,  (Teitelman,  1978);  Durham, [48{ [6{  Lamb  and  Saxe  are  concerned  with  the  correction  of  errors  made  as  the [48{ [6{  user  interacts  with  the  computer;  Peterson's  spelling  corrector  (Peterson, [48{ [6{  1980b)  is  designed  to  work  interactively.  The  'writers  workbench'  detec- [48{ [6{  tor,  SPELLWWB,  takes  a  file  of  text  and  looks  for  errors,  then  permits [48{ [6{  the  user  to  see  each  error  in  context  and  to  use  the  corrector, [6{   49 [48{ [6{  SPELLTELL, to  correct  it interactively  (Macdonald,  Frase et  al., 1982).  The [48{ [6{  'best  way'  of  working  -  interactively  or  otherwise  -  will  vary  for  dif- [48{ [6{  ferent applications and for different users. [48{ [6{  Different  modes  of  operation  are  also  offered.  The  Interlisp  corrector [48{ [6{  operates  in  2  modes:  cautious  and  trusting.  If  it  is  able  to  correct  a [48{ [6{  mistake,  in  trusting  mode  it  will  make  the  correction  and  continue  as [48{ [6{  though  no  error  had  occurred.  In  cautious  mode  it  will  ask  for  approval [48{ [6{  before  making  the  correction.  Durham,  Lamb  &  Saxe  (Durham  et  al, [48{ [6{  1983)  discuss  the  difficulties  of  automatic  correction  of  errors,  par- [48{ [6{  ticularly in  the  context of  accidental  acceptance of  irreversible  commands. [48{ [6{  Peterson  (Peterson,  1980b)  suggests  asking  the  user  whether  he  wishes  to [48{ [6{  replace  an  error  with  the  correction  offered,  or  not,  with  the  additional [48{ [6{  option of  'replace  and remember'  for  future automatic  replacement  should [48{ [6{  the  error  recur.  In  the  correction  of  commands  in  programming  lan- [48{ [6{  guages,  automatic  correction  may  be  convenient  for  correcting  mistypings, [48{ [6{  but  dangerous  when  giving  irreversible  commands.  In  text,  it  may  be [48{ [6{  desirable  to  correct  all  occurrences  of  a  repeated  misspelling,  but  less [48{ [6{  desirable  when  a  correct  word  is  categorised  as  an  error,  or  when  an [48{ [6{  error  is  miscorrected.  When  several  different  corrections  are  offered [48{ [6{  the user will need to be consulted. [48{ [6{  3.2.3. Use of context: syntax and semantics [48{ [6{  In  considering  computer  methods  of  error  detection  and  correction,  an [48{ [6{  obvious  starting  point  would  be  "How  do  we,  as  humans,  detect  and [48{ [6{  correct spelling errors?" If a word is misspelt, it may not 'look right' to [48{ [6{  us. We detect an error: the boy closd the door A  word  may  have  been  misspelt  in  the  form  of  another  word  and  not [48{ [6{  'make sense' to us: the girl bounced the bull Given  errors  in  isolation,  we  can  detect  when  a  word  is  misspelt  in  the [48{ [6{  first  way,  but  cannot  detect  when  it  is  misspelt  as  another  word.  We [48{ [6{  need  the  context  surrounding  the  word,  at  least,  to  recognise  the  latter [48{ [6{  sorts  of  errors  though  it  may  also  be  useful  for  corrections  of  the [6{   50 [48{ [6{  former kind.  The difficulty of the task of proof-reading also [48{ [6{  demonstrates  the  strength  of  context:  we  often  misread  a  word,  in [48{ [6{  context,  as  we  think  it  should  be  and  may  fail  even  to  detect  an  error. [48{ [6{  Thus, in some cases, errors may be more easy to detect out of context. [48{ [6{  One  source  of  information  which  may  help  us  to  detect  and  correct [48{ [6{  errors  in  context  is  the  grammatical  structure  of  the  sentence,  or [48{ [6{  program  code,  in  which  an  error  is  embedded:  the  syntax.  Error [48{ [6{  detection  in  programming  languages  uses  knowledge  of  syntax  to  detect [48{ [6{  and correct errors:  the  Interlisp correction  program  (Teitelman, [48{ [6{  1978)  attempts  to  correct  mistakes  by  using  the  current  context  of  the [48{ [6{  computation  plus  information  about  what  the  user  had  been  doing.  Pro- [48{ [6{  gramming  commands  are  part  of  a  well  defined  syntax  and  restricted [48{ [6{  vocabulary.  Text,  in  'natural  language',  is  not  so  constrained.  The  study [48{ [6{  of syntax  in  natural  language  is  a  large area  of  research  (Winograd,  1983). [48{ [6{  Research  into  'ill-formed  input'  in  natural  language  may  give  us  some [48{ [6{  information  about  misspelt  words  in  context  (Fass  and  Wilks,  1983).  It  is [48{ [6{  difficult  to  discover  what  'part  of  speech'  a  spelling  might  be  labelled [48{ [6{  with,  though  some  work  has  been  done  on  this  problem  by  Atwell  (Atwell, [48{ [6{  1983). [48{ [6{However, two problems, in particular, may arise:  1. What  happens  when  the  surrounding  words,  used  to  provide  information about the misspelt word, are also misspelt:  I poot owt teh kat and karyd inn tha milke.  2. How  do  natural  language  programs  cope  with  text  that  not  only has misspelt words in it  but that may also  be  'ungrammatical' (or,  at  least,  for which  the  grammar  cannot  be  defined  easily)?  This  can  be  the  case  with  the  children  of  interest here. Additionally,  in  attempting  to  identify  errors  that  are  misspelt  'as  other [48{ [6{  words' they may  be of  the same  syntactic category  as the  intended  word. [48{ [6{  Even  if  an  error  is  detected,  and  we  know  that  the  correction  should [48{ [6{  have  a  particular  syntactic  type  e.g.  noun,  this  still  may  not  reduce [48{ [6{  greatly the number of possibilities for the correction. [48{ [6{  When  considering  words  in  context,  the  meaning  of  the  surrounding [48{ [6{  words  and  the  general  domain  will  also  provide  some  useful  information. [48{ [6{  Semantic  analysis  of  the  sentence  or  text  may  aid  detection  and  correc- [6{   51 [48{ [6{  tion  of  spelling  errors.  Fass  (Fass,  1983)  uses  semantic  information  in [48{ [6{  spelling  error  correction.  Again,  as  with  syntactic  information,  it  seems [48{ [6{  that  the  information  provided  here  is  not  specific  enough  to  aid  correc- [48{ [6{  tion.  Research  investigating  the  use  of  context  to  detect  and  correct [48{ [6{  spelling  errors  is  not  far  enough  advanced  to  allow  formalisation  of  this [48{ [6{  process  in  a  computer  program,  though  current  researchers  are  starting [48{ [6{  to  work  on  this  problem  (Mitton,  1984a).  In  the  programs  described  in [48{ [6{  this thesis no  context information  is used  to aid  correction. The  problem [48{ [6{  of detection will be further discussed in chapter 5. [48{ [6{  One  method  of  use  of  knowledge  of  the  subject  matter  of  the  text [48{ [6{  would  be  the  use  of  dictionaries  specific  to  the  topic  area,  to  constrain [48{ [6{  the  possibilities  of  options  for  correction.  Pollock  and  Zamora  use [48{ [6{  dictionaries  specific  to  the  type  of  documents  they  are  concerned  with. [48{ [6{  Peterson  also  recommends  using  dictionaries  of  'document-specific'  words. [48{ [6{  Use of dictionaries in error detection will be discussed below. [6{   52 [48{ [6{  3.3. Error Detection [48{ [6{Different  methods  of  spelling  error  detection,  and  issues  specifically [48{ [6{  related to them, are discussed in this chapter. [48{ [6{  3.3.1. Digram and trigram frequencies [48{ [6{  The UNIX TYPO program uses statistics of the frequencies of digrams and [48{ [6{  trigrams,  in  English,  to  detect  probable  errors. Typically,  in  a  large [48{ [6{  sample  of  text,  only  550  digrams  (70%  of  possible  digrams)  and  5000 [48{ [6{  trigrams  (25%  of  possible  trigrams)  actually  occur,  (Peterson,  1980a).  The [48{ [6{  frequency of  digrams  and  trigrams  in  the  input  text  is  computed.  A  list [48{ [6{  is  made  of  distinct  tokens  or  words  in  the  text.  For  each  token  an [48{ [6{  'index  of  peculiarity'  is  computed.  This  index  is  a  statistical  measure  of [48{ [6{  the  probability  (for  all  trigrams  in  the  token)  that  each  trigram  was [48{ [6{  produced  by  the  same  source  as  the  rest  of  the  text.  TYPO  outputs  a [48{ [6{  list  of  tokens,  sorted  by  index  of  peculiarity,  highest  index  at  the  front [48{ [6{  of  the  list.  Any  token  matching  to  a  list  of  2,500  common  words  is [48{ [6{  omitted  and  assumed  correct.  The  user  still  has  the  task  of  deciding [48{ [6{  which  of  the  words  are  really  misspellings.  This  task  is  made  more [48{ [6{  difficult by the lack of context. [48{ [6{  3.3.2. Dictionary look-up [48{ [6{The  most  common  method  of  error  detection  is  'dictionary  look-up'. [48{ [6{  Each  input  word  or  token  is  compared  with  a  dictionary  of  legitimate [48{ [6{  words  or  a  list  of  acceptable  tokens.  If  a  match  is  found  then  the [48{ [6{  word  is  considered  to  be  correct.  If  no  match  is  found  the  word  is [48{ [6{  incorrect, or  not  in the  dictionary.  The  UNIX  SPELL program  attempts  to [48{ [6{  match  each  word  in  a  text  file  with  words  stored  in  a  large  dictionary. [48{ [6{  The  input  words  for  which  no  matches  are  found  are  output  as  a  list. [48{ [6{  Peterson's  program  also  attempts  to  match  input  word  and  dictionary [48{ [6{  words,  assuming  an  error  if  the  match  fails.  Each  word  in  a  SOLO [48{ [6{  command  (Lewis,  1980)  is  compared  with  a  list  of  procedure  names,  and  a [48{ [6{  list of node and relation names. [48{ [6{  Authors  of  error  detection  programs,  using  'dictionary  looking-up',  are [48{ [6{  concerned particularly with issues of size of dictionary, representation, and [48{ [6{  search strategies. (See (Peterson, 1980b) for a review). [6{   53 [48{ [6{  3.3.3. Dictionary size [48{ [6{The  larger  the  dictionary,  the  more  likely  it  is  that  correctly  spelt [48{ [6{  words  will  be  matched,  reducing  the  chance  of  correct  words  being [48{ [6{  rejected  as  errors.  Pollock  and  Zamora  use  a  40,000  word  dictionary  to [48{ [6{  process  some  25  million  words  (including  repeats)  (Pollock  and  Zamora, [48{ [6{  1984).  SPELLWWB  (Macdonald,  Frase  et  al.,  1982)  uses  the  UNIX  30,000 [48{ [6{  word  dictionary.  If  the  dictionary  is  too  large,  however,  it  may  contain [48{ [6{  unusual  and  archaic  words  that  could  match  to  spelling  errors,  causing [48{ [6{  them  to  go  undetected.  Smaller  dictionaries  may  constrain  the  vocabulary [48{ [6{  to  include  most  words  required  by  the  user,  but  will  increase  the [48{ [6{  possibility  of  correct  words  not  matching.  The  SOLO  program  checker [48{ [6{  (Lewis,  1980)  uses  a  small  table  of  stored  words,  but  in  this  case  it  is [48{ [6{  desirable  that  the  user  is  limited  to  a  restricted  vocabulary.  Peterson [48{ [6{  (Peterson,  1980a),  quoting  statistics  from  the  Brown  corpus,  states  that [48{ [6{  over  half  of  the  tokens  in  normal  English  occur  in  a  vocabulary  of  only [48{ [6{  134 words.  He also  states  that the  total number  of tokens  in  documents [48{ [6{  tends  to  be  moderately  small,  1,500  in  a  10,000  word  document,  and  that [48{ [6{  they often use words of particular interest to the subject area: [48{ [6{  "for  each  specific  document  there  exists  a  (small)  table  of  words which occur frequently in that document (but not in normal  English)". p. 681, (Peterson, 1980a) [48{ [6{  He suggests a three-table structure of words to be searched:  - firstly,  a  small  table  of  most  common  English  words  (100-200  words); - secondly a table  of (other) words  already used  in the  document,  constructed dynamically (1,000-2,000 words);  - finally,  a  large  list  of  the  remaining  words  in  the  main  diction-  ary (10,000-100,000 words).  The  approach  of  providing  a  dictionary  containing  vocabulary  specific  to [48{ [6{  the topic of interest, in addition to a small general dictionary, is taken in [48{ [6{  this study. It will be further discussed in chapter 5. [48{ [6{  Efficiency  of  error  detection  depends  upon  the  size  of  the  dictionary, [48{ [6{  the  representation,  and  the  search  strategy  adopted.  It  also  depends  on [48{ [6{  whether  the  dictionary  is  stored  in-core,  or  on  a  secondary  disc  or  in  a [48{ [6{  separate part of virtual memory requiring longer access and search times. [6{   54 [48{ [6{  3.3.4. Representation and search [48{ [6{  Input  forms  may  be  represented  as  strings  of  characters  and  matched [48{ [6{  to  dictionary  forms  represented  in  the  same  way,  (SPELLWWB  (Macdonald, [48{ [6{  Frase  et  al.,  1982),  (Damerau,  1964)).  Other  representations  may  also  be [48{ [6{  used.  Davidson  (Davidson,  1962)  converts  input  forms  to  abbreviations  to [48{ [6{  match to  a list  of names  also stored  as abbreviations.  In  order to  speed [48{ [6{  up  dictionary  access,  hash  coding  methods  of  representing  words  are  used [48{ [6{  by  Mor  &  Frankel  (Mor  and  Fraenkel,  1982)  and  in  the  DEC-10  SPELL [48{ [6{  program  (Peterson,  1980b).  Mor  and  Frankel  discuss  the  effectiveness  of [48{ [6{  hash  coding  and  make  a  case  for  its  use.  In  the  Dec-10  SPELL  program [48{ [6{  words  are  initially  coded  by  first  2  letters  and  word  length,  and  each [48{ [6{  hash  table  entry  is  a  pointer  to  a  chain  of  words  of  the  same  initial  2 [48{ [6{  letters and word length. [48{ [6{Tenczar  &  Golden  (Tenczar  and  Golden,  1962)  use  no  'string [48{ [6{  representation' of words. Instead, words are mapped onto a fixed number [48{ [6{  of  bits,  (the  number  of  bits  depending  upon  the  word  length  of  the [48{ [6{  computer,  not  the  length  of  the  word  represented).  Tenczar  &  Golden [48{ [6{  state  that  they  use  human  criteria  to  set  the  bit  representation  of  the [48{ [6{  word. Features  of  the  word  that  are  represented  are  length,  first [48{ [6{  character,  letter  content,  letter  order  (based  on  letter  digraphs)  and [48{ [6{  syllabic pronunciation.  These  features  are  ordered  (subjectively)  by  impor- [48{ [6{  tance.  The  coded  input  word  is  compared  with  a  stored  word  and  if  no [48{ [6{  bits  are  in  conflict  the  words  are  taken  to  be  the  same.  Words  that [48{ [6{  look alike  are  stored near  each  other in  the  dictionary.  The  dictionary  is [48{ [6{  searched  using  `binary  chop  search'.  Details  of  the  coding  algorithm  are [48{ [6{  not  given,  though  the  authors  indicate  that  a  scheme,  using  41  bits  to [48{ [6{  record  information,  has  been  shown  to  perform  satisfactorily.  Galli  & [48{ [6{  Yamada (Galli  and  Yamada,  1968) verify  the  spelling  of  words  by  comparison [48{ [6{  with a  56,000  word  dictionary,  consisting  of words,  stems  and  affixes  plus [48{ [6{  2,000  common  misspellings.  The  problems  of  affix  analysis  will  be  con- [48{ [6{  sidered below. [48{ [6{Peterson  (Peterson,  1980b)  uses  different  representations  for  each  of [48{ [6{  his  3  dictionaries.  The  dictionary  of  commonly  used  words  is  a  modified [6{   55 [48{ [6{  1[48{ [6{`trie'  structure,  where  the  root  is  an  array  of  all  possible  first  letters. [48{ [6{  Branches  go  for  each  of  these  first  letter  nodes  to  each  `next  letter' [48{ [6{  node.  The  structure  resembles  a  tree  initially,  with  third  level  nodes [48{ [6{  holding third letters etc. However, all common endings are represented  by [48{ [6{  the same  nodes.  So  the  tree  becomes  more  like a  graph.  Information  is [48{ [6{  stored in  each  letter  node to  indicate  whether  the  letter  is  `the  end  of [48{ [6{  a  word'  or  not.  Input  words  are  matched,  letter  by  letter,  down  the [48{ [6{  `trie'.  If  no  match  is  found,  i.e.  the  end  of  the  input  word  is  not [48{ [6{  reached  at  the  same  time  as  a  letter  with  `end-of-word'  marker,  the [48{ [6{  document  specific  dictionary  is  checked.  This  dictionary  is  stored  as  a [48{ [6{  hash  table,  coded  by  first  letter,  last  letter  and  word  length.  If  the [48{ [6{  word  is  not  found  here,  an  attempt  is  made  to  match  it  in  the  very [48{ [6{  large  disc-stored  general  dictionary.  Here,  words  are  represented  as [48{ [6{  strings, but stored  in blocks in  direct access files.  The  words are  sorted [48{ [6{  within  the  blocks,  and  blocks  are  indexed  by  the  first  and  last  word  in [48{ [6{  each  block.  The  correct  block  for  an  input  word  is  found  by  binary [48{ [6{  search,  and  the  word  found  by  systematic  search  of  the  block  itself.  If [48{ [6{  the  word  is  found,  it  is  taken  to  be  correct  (and  is  temporarily  moved [48{ [6{  to  the  document  specific  dictionary).  If  the  word  is  not  found  it  is [48{ [6{  taken to be a misspelling (Peterson, 1980b). [48{ [6{  Two problems that have not yet been considered are:  1. What is a potential word?  2. How  does  a  spelling  detection  and  correction  program  cope  with affixes? [48{ [6{3.3.5. What is a word? [48{ [6{In  designing  programs,  decisions  must  be  made  about  what  are  to  be [48{ [6{  included  as  misspellings,  and  what  not.  How  should  the  program  deal  with [48{ [6{  proper names,  abbreviations and  split or  concatenated words?  Should  such [48{ [6{  characters  as  hyphens,  apostrophes  and  digits  be  dealt  with  in  the  same [48{ [6{  way  as  other  characters?  Also,  how  should  the  problem  of  case  be  dealt [48{ [6{  with: should  upper and lower  case letters  be considered  as equivalent,  or [48{ [6{  as distinct? [48{ [6{  [48{ [6{1[48{ [6{Peterson's terminology: not quite a tree. [6{   56 [48{ [6{  Peterson (Peterson, 1980a) addresses the problem of definition of poten- [48{ [6{  tial words (tokens). He defines a word as: [48{ [6{  "a sequence of letters, separated by delimiters" [48{ [6{  He  includes  blanks  and  special  characters,  such  as  commas,  full  stops  and [48{ [6{  colons,  as  delimiters,  but  points  out  the  difficulty  of  interpreting  num- [48{ [6{  bers, hyphens  and apostrophes.  He suggests  that  the inclusion  of  numbers [48{ [6{  as letters  should  be  left  as  an  option  for  the user.  He  also  recommends [48{ [6{  consideration  of  hyphens  as  delimiters,  in  normal  use  to  create  compound [48{ [6{  words, but not  to permit `end-of-line'  hyphenation of words.  Apostrophes [48{ [6{  are  considered  as  letters  when  they  are  used  as  indications  of  omitted [48{ [6{  letters (I'd, won't) and as delimiters at the beginning or end of a token. [48{ [6{  As Peterson  states,  the case  problem  is more  complex.  More  thought  is [48{ [6{  required when deciding how to  deal with it  when designing spelling  correc- [48{ [6{  tors.  Some  proper  names  may  be  partly  or  entirely  spelt  in  upper  case, [48{ [6{  and  might  be  considered  misspelt  if  in  lower  case.  When  a  word  is [48{ [6{  capitalised  at  the  start  of  a  sentence  or  speech  we  will  still  want  it  to [48{ [6{  match  with  lower  case  versions  elsewhere  in  the  text:  in  this  situation [48{ [6{  we  may  wish  to  ignore  case  altogether.  However,  we  do  not  want  to [48{ [6{  permit  arbitrary  (perhaps  mistyped)  capitalisation,  occurring  in  the  middle [48{ [6{  of  words.  In  programming  languages,  the  case  of  letters  may  carry  more [48{ [6{  information: special decisions will have to be made to account for this. [48{ [6{  Two  words  may  be  concatenated  to  make  one,  or  a  single  word  split [48{ [6{  into two.  The  detection algorithm  may  need to  indicate  this as  an  error. [48{ [6{  The  SOLO  program  corrector  (Lewis,  1980)  looks  for  spaces  missing  in  the [48{ [6{  input  word:  it  matches  the  target  command  to  the  leftmost  characters [48{ [6{  of  the  input  word  and  then  looks  for  a  match  for  the  rest  of  the [48{ [6{  character string.  Pollock  & Zamora's  text  correction  program  incorporates [48{ [6{  a  special  routine  that  checks  for  concatenation  of  `function'  words  (i.e. [48{ [6{  non-context  words  in  text,  such  as  `of',  `their',  `to').  Most  programs  do [48{ [6{  not specifically look for concatenated or split words. [6{   57 [48{ [6{  3.3.6. Dealing with affixes [48{ [6{One  method  of  reducing  dictionary  size  is  to  remove  all  words  with [48{ [6{  affixes (prefixes  and  suffixes)  and store  instead  the  root  word and  a  list [48{ [6{  of  affixes.  Peterson  (Peterson,  1980a)  discusses  affix  removal.  He [48{ [6{  suggests  two  approaches.  The  simplest  involves  examining  the  input  token [48{ [6{  for  affixes,  matched  to  a  list  of  common  affixes.  These  are  then [48{ [6{  removed  from  the  input  taken  and  the  stripped  root  matched  with  the [48{ [6{  dictionary. If  a  match  is  found  the  word  is  taken  to  be  correct. [48{ [6{  Misspellings,  where  correctly  spelt  affixes  are  incorrectly  attached  to [48{ [6{  correctly  spelt  words  e.g.  carryed,  babys,  would  not  be  detected. [48{ [6{  Peterson's  solution  is  to  flag  each  word  in  the  dictionary  with  its  legal [48{ [6{  affixes.  Where  affixes  are  stripped  and  a  root  word  found,  a  flag  will [48{ [6{  be  checked  to  see  whether  the  particular  affix  is  legal  for  that  root. [48{ [6{  The  system  of  flags  and  interpretation,  that  Peterson  describes  (Peterson, [48{ [6{  1980b),  is  that  used  in  the  Dec-10  SPELL  program.  Galli  &  Yamada  (Galli [48{ [6{  and  Yamada,  1968)  claim  that  suffix  analysis  is  not  cost  effective.  Dur- [48{ [6{  ham,  Lamb  &  Saxe  (Durham  et  al,  1983)  suggest  that  affix  analysis  is  not [48{ [6{  necessary  in  user-interface  applications. Atwell,  however,  uses  suffixes [48{ [6{  associated  with  particular  sets  of  tags  to  provide  information  about  text [48{ [6{  words not found in the dictionary (Atwell, 1983). [6{   58 [48{ [6{  3.4. Error Correction [48{ [6{Having  detected  an  error,  by  whatever  means,  the  task  of  the  correc- [48{ [6{  tion  program  is  to  offer  a  correction,  or  possible  corrections,  for  the [48{ [6{  error. [48{ [6{Methods of  error  correction  are  mostly  based  on  some  form  of  pattern [48{ [6{  matching  of  a  word  to  a  dictionary  or  list.  Exceptions  are  those [48{ [6{  programs  working  at  the  sentence  or  phrase  level,  using  syntax  or  seman- [48{ [6{  tics  (Fass,  1983;Atwell,  1983),  the  difficulties  of  which  have  already  been [48{ [6{  discussed. [48{ [6{Pattern  matching  in  error  correction  is  generally  carried  out  in  one  of [48{ [6{  two ways.  Either  some representation  of  the  misspelling is  compared  with [48{ [6{  a  number  of  words  in  the  dictionary,  and  the  closest  match  (by  some [48{ [6{  criteria) is taken  to be the correction.  Or, the misspelling is  transformed [48{ [6{  in  some  way  to  produce  a  new  word,  and  a  check  made  to  see  if  this [48{ [6{  new word is  in the dictionary.  Pollock refers to  the former as a  relative [48{ [6{  strategy and latter as an absolute strategy (Pollock, 1982). [48{ [6{  3.4.1. Previous errors and dictionaries of common misspellings [48{ [6{  A simple method  of error correction  is to check  whether the  misspelling [48{ [6{  is  in  a  list  of  previous  spelling  errors  made.  The  Interlisp  correction [48{ [6{  program  (Teitelman,  1978)  uses  this  as  a  `first-pass'  method:  if  it  fails [48{ [6{  to produce a  correction it then  tries other methods.  A similar method  is [48{ [6{  that  of  comparing  the  misspelling  with  those  in  a  dictionary  of  common [48{ [6{  misspellings.  This  may  also  be  used  as  a  `first-pass'  method  (Pollock  and [48{ [6{  Zamora,  1984)  or  in  the  case  of  the  SPELLTELL  correction  program [48{ [6{  (Macdonald,  Frase  et  al.,  1982)  the  dictionary  of  800  comon  misspellings  is [48{ [6{  the main dictionary, against which `relative' matches are made. [48{ [6{  Techniques using relative strategies include:  - finding  matching  substrings  in  the  misspelling  and  dictionary  words; - representing  misspellings  as  abbreviations  and  matching  to  a  dictionary of abbreviations;  - assigning  scores  to  matching  features  of  two  strings  to  give  a  measure of closeness; [6{   59 [48{ [6{  - using a similarity  key to  represent features  of an misspelling,  as  a code, and finding similarly coded dictionary words. [48{ [6{  3.4.2. Matching substrings [48{ [6{To  correct  an  error  using  the  SPELLTELL  program,  (Macdonald,  Frase  et [48{ [6{  al., 1982),  the  user  types  in  the part  of  the  word that  she  knows  to  be [48{ [6{  correct. The program  displays all the  words in its  dictionary in which  the [48{ [6{  string occurs.  The  user  then  has  to  select the  correct  word  from  those [48{ [6{  matched  strings  offered:  every  string  that  matches  is  offered.  For  a [48{ [6{  user  who  is  a  reasonably  good  speller,  and  who  is  only  unsure  of  the [48{ [6{  spelling  of  some  small  part  of  a  word,  this  may  be  useful.  However, [48{ [6{  should  any  part  of  this  `correct  string'  be  incorrect  the  match  will  fail [48{ [6{  and the correction will not be found. [48{ [6{  Alberga  (Alberga,  1967)  discusses  an  algorithm  proposed  by  Baskin  and [48{ [6{  Selfridge.  The  longest  matching  substrings  of  2  strings  are  found  and [48{ [6{  paired, then the longest matching substrings  of the remaining elements  are [48{ [6{  found and  paired, and  so on  until no  further matches  can be  found.  The [48{ [6{  string  with  the  greatest  percentage  matched  to  the  misspelling  would  be [48{ [6{  the correction. [48{ [6{The  SOLO spelling corrector (Lewis,  1980) also uses a `matching [48{ [6{  substrings' algorithm.  An initial  check is made  to see  that there  is not  a [48{ [6{  difference  of  two  letters  or  more  in  the  lengths  of  the  strings  to  be [48{ [6{  compared.  The  strings  are then  matched,  letter  by  letter,  left  to  right, [48{ [6{  then  right  to  left.  The  number  of  matches  is  counted,  and  this  number [48{ [6{  divided  by  the  length  of  the  larger  string.  If  the  resulting  value  is [48{ [6{  greater  than  70%  then  a  correction  is  assumed  and  reported  to  the  user. [48{ [6{  If,  however,  more  than  one  possible  correction  is  found,  no  match  is [48{ [6{  assumed. [48{ [6{3.4.3. Abbreviations [48{ [6{Davidson's  program  for  finding  names  in  a  airline  reservation  system  uses [48{ [6{  four  letter  abbreviations  to  represents  surnames.  Names  are  abbreviated [48{ [6{  by  1. deleting all vowels, [6{   60 [48{ [6{  2. deleting all occurrences  of H,  W and Y  (except where they  are  the initial letter), 3. deleting  all  but  one  of  each  set  of  contiguously  replicated  letters. Blanks are  used  to  fill  the  right  end  of  the abbreviation  if  necessary.  A [48{ [6{  comparison  is  made,  with  the  list  of  abbreviated  names,  for  an  exact [48{ [6{  match.  If  no  exact  match  is  found  scores  are  assigned  to  each  pair  of [48{ [6{  possible  matches,  where  the  score  is  one  for  each  letter  of  the  largest [48{ [6{  matching  substring  in  the  pair  of  abbreviations.  If  there  are  multiple [48{ [6{  matches the user can be asked to select manually. [48{ [6{  Damerau  (Damerau,  1964)  describes  a  program  by  Blair  that  also  sys- [48{ [6{  tematically  abbreviates  dictionary  items  and  words  to  be  recognised. [48{ [6{  Taking  the  strings  to  be  compared,  they  are  each  converted  to  a  four [48{ [6{  letter  abbreviation.  A  value  is  assigned  to  each  letter  in  the  string, [48{ [6{  proportional  to  the  desirability  of  deleting  that  letter.  The  `desirability' [48{ [6{  weighting  is  determined  by  the  letter  itself  and  its  position  in  the  word, [48{ [6{  and is  an  approximation  of the  probability  of  that  letter  in  that  position [48{ [6{  being  an  error.  Those  letters  with  highest  `probability  of  error'  are [48{ [6{  systematically deleted from the string  until four letters  remain. The  four [48{ [6{  letter  abbreviations  are  then  compared.  If  the  abbreviation  for  an  error [48{ [6{  matches  to  more  than  one  four  letter  abbreviation  from  the  dictionary [48{ [6{  then the process is repeated using longer abbreviations. [48{ [6{  3.4.4. Measures of string similarity [48{ [6{  Correction  algorithms  proposed  by  Faulk  (1967)  &  Morrison  are  described [48{ [6{  briefly  in  Alberga(1967).  Faulk  defines  three  measures  of  similarity  be- [48{ [6{  tween  strings  of  arbitrary  elements  (sentences  of  words  in  this  case). [48{ [6{  The measures are: 1. material  similarity  -  the  extent  to  which  strings  are  composed  of matched elements; 2. ordinal  similarity  -  the  extent  to  which  the  matched  elements  are arranged in the same order;  3. positional  similarity  -  the  extent  to  which  the  matched  ele-  ments are located in corresponding positions.  Morrison,  investigating  the  string  matching  problem  in  the  context  of [48{ [6{  C.A.I., proposed a number of simple approximations for matching strings: [6{   61 [48{ [6{  (a) the number of elements matched until the first mismatch;  (b) the number of matched elements regardless of order;  (c) the number of matches in increasing order;  (d) the number of unmatched elements in each string.  A  measure  of  closeness  is  used  by  Teitelman's  Interlisp  spelling  corrector [48{ [6{  to  find  the  closest  match  from  a  list  to  the  error  token.  The [48{ [6{  `closeness'  is  defined  as  being  inversely  proportional  to  the  number  of [48{ [6{  disagreements  between  two  words  and  proportional  to  the  length  of  the [48{ [6{  larger  word.  Characters  in  agreement  are  those,  when  the  strings  are [48{ [6{  scanned  left  to  right,  that  are  either  the  same  or  on  the  same  teletype [48{ [6{  key,  or  lower  case  for  upper  case  (or  vice-versa).  Transposed  letters,  in [48{ [6{  conjunction  with  no  other  letter  disagreements,  and  doubled  letters  are [48{ [6{  not  counted  as  disagreements.  A  criteria  is  set  for  closeness  and  if  no [48{ [6{  word  is  sufficiently  close  an  error  is  recorded,  otherwise  a  correction  is [48{ [6{  automatically  made  (or  suggested  to  the  user  if  approval  is  required).  The [48{ [6{  corrector  is  restricted  to  Interlisp  and  uses  lists  for  checking  possible [48{ [6{  errors, the list  used depending upon  the type of  the word being  checked. [48{ [6{  Corrections  of  words  selected  as  misspellings  are  moved  to  the  front  of [48{ [6{  the  list  so  that,  if  an  error  is  repeated,  fewer  comparisons  will  need  to [48{ [6{  be  made.  Difficulty  is  encountered  with  setting  a  reasonable  criteria  of [48{ [6{  closeness for short words. [48{ [6{The  Soundex  system  (Munnecke,  1980)  uses  a  similarity  key  to  represent [48{ [6{  names  in  a  genealogical  database.  The  representation  of  the  names  is [48{ [6{  based  on  pronunciation  rather  than  spelling.  Each  name  is  given  a [48{ [6{  four-character soundex code:  - the first letter of the name (the surname) is kept;  - all occurrences  of  vowels, and  w,y,n,and g  are deleted  from  the  remainder of the name;  - the rest of the letters are assigned to numbered groups:  1 = b,f,p,v 2 = c,g,j,k,s,x,z 3 = d,t 4 = l 5 = m,n 6 = r  - if adjacent  characters  fall  in the  same  numbered  group,  all  but  the first of that group are deleted;  - the  name  is  then  coded  as  the  first  letter  followed  by  digits  representing the remaining letters, in order; [6{   62 [48{ [6{  - only  the  first  3  digits  are  kept,  and  if  there  are  less  than  3,  zeros are added. All  matching  names  are  found,  from  the  coded  list  of  names,  and  offered [48{ [6{  to the user. [48{ [6{In  the  Plato  spelling  correction  program  (Tenczar  and  Golden, [48{ [6{  1962)  features  of  words  such  as  length,  first  character,  and  letter  order [48{ [6{  are represented by a  bit code. If  the coded input  word is compared  with [48{ [6{  the  dictionary,  and  no  match  is  found,  then  the  word  is  taken  to  be [48{ [6{  either a misspelling, or another word. The word  with the least number  of [48{ [6{  bits  in  conflict  with  the  misspelling  is  found,  and  the  number  of  bits  in [48{ [6{  conflict  is  noted.  If  this  number  is  less  than  some  set  value,  k,  then [48{ [6{  the  word  is  assumed  to  be  a  misspelling.  If,  however,  the  misspelling [48{ [6{  differs,  by  more  than  k  bits,  from  the  closest  dictionary  word,  it  is [48{ [6{  assumed to be a different word i.e. it is not in the dictionary. [48{ [6{  3.4.5. Editing rules [48{ [6{Damerau (Damerau,  1964) indicates that  80% of  all spelling  errors are  the [48{ [6{  result of: 1. transposition of 2 letters  2. one letter extra 3. one letter missing 4. one letter wrong  These  rules  form  the  basis  of  a  number  of  spelling  correction  programs. [48{ [6{  The  basic  algorithm  that  they  use  is  to  construct  a  list  of  all  the  words [48{ [6{  (taken from  the dictionary)  which can  produce the  error by  application  of [48{ [6{  one  of  these  rules.  The  list  of  candidate  corrections  is  produced  by [48{ [6{  multiple searches of the dictionary:  - testing  for  one  extra  letter  in  the  misspelling  involves  deleting  each  letter  from  it,  in  order,  and  searching  for  the  new  word  in  the  dictionary,  a  maximum  of  N  operations  for  a  word  of  length N. - to  test  for  transposition  errors  each  pair  of  letters  in  the  misspelling  is  swopped,  and  each  new  word  produced  is  searched  for: a maximum of N-1 operations.  - to  test  for  one  letter  missing,  or  one  letter  wrong,  involves  a  greater amount of searching: [6{   63 [48{ [6{ * for  a  letter  omitted  a  maximum  of  26(N+1)  possible[48{ [6{  2[48{ [6{insertions ; * for a letter wrong, a maximum of 25N possible changes.  Any matches found are added to the list of candidates. [48{ [6{  Peterson (Peterson, 1980a) suggests two strategies.  1. a `match-any'  character  is  inserted in  all  positions in  the  word  (to  test  the  one  letter  missing  )  and  substituted  for  all  letters  in  turn  (to  test  for  a  wrong  letter),  and  each  time  a  match is searched for; 2. each  potential  character  is  substituted  (or  inserted)  in  each  character position, and search carried out normally. [48{ [6{  The basic algorithm, using one of the two strategies  suggested, has been [48{ [6{  used by the  Dec-10 SPELL program  (the first spelling  corrector written  as [48{ [6{  an  applications  program,  1971);  by  Durham,  Lamb  and  Saxe  (Durham  et  al, [48{ [6{  1983);  by  Mor  &  Frankel,  (Mor  and  Fraenkel,  1982)  and  by  Peterson [48{ [6{  (Peterson, 1980b). [48{ [6{3.4.6. Combined approaches [48{ [6{To further constrain  search, digram and  trigram frequencies can be  used [48{ [6{  in  conjunction  with  these  edit  rules.  Omond,  (Omond,  1977),  uses  digram [48{ [6{  frequencies  to  determine  the  order  in  which  to  apply  the  edit  rules,  and [48{ [6{  for  which  characters.  Information  about  the  frequency  of  occurrence  of [48{ [6{  each  letter-pair  is  used  to  select  the  letter  most  likely  in  error  in  the [48{ [6{  misspelling.  The  letter  is  replaced  by  the  one  most  likely  to  occur  in [48{ [6{  that  position.  The  dictionary  is  searched  for  the  resulting  word.  If  a [48{ [6{  match  is  found  it  is  accepted  as  the  correction.  Otherwise,  the  next [48{ [6{  most likely substitution is made. All substitutions are tried, then deletions [48{ [6{  and  insertions,  in  order  of  letter  likelihood  using  digram  frequencies,  until [48{ [6{  a  successful  match  is  made.  However,  the  first  successful  match  may  not [48{ [6{  be the correct one. [48{ [6{Zamora,  Pollock  &  Zamora  (Zamora,  Pollock  and  Zamora,  1981)  also  use    [48{ [6{2[48{ [6{26 possible characters * N+1 places it could be inserted [6{   64 [48{ [6{  digram  and  trigram  frequencies  for  spelling  error  detection,  but  conclude [48{ [6{  that  they  are  insufficiently  precise  to  be  practical.  Digrams  are  also [48{ [6{  applied by Cornew (Cornew, 1968) to single substitution errors. [48{ [6{  Pollock &  Zamora  (Pollock  and  Zamora,  1984) also  use  the  edit  operations [48{ [6{  in  their  spelling  corrector.  A  list  of  `most  similar  words'  is  produced  by [48{ [6{  comparison  of  the  misspelling  and  the  dictionary  word,  using  two  similarity [48{ [6{  keys. Pollock (Pollock, 1982) defines similarity keys as: [48{ [6{  "techniques  for  generating  compact  representations  of  strings  that preserve  their fundamental  properties but  not minor  details."  p. 284 [48{ [6{The  selection  of  the  actual  correction  is  then  determined  by  'error [48{ [6{  reversal', a measure of 'distance' between two strings. The two keys that [48{ [6{  they  use  are  firstly,  the  skeleton  key,  and  secondly,  the  omission  key. [48{ [6{  The  skeleton  key  is  coded  by  the  first  letter  of  the  word  followed  by [48{ [6{  the remaining unique  consonants in  order, then  the unique  vowels in  order. [48{ [6{  The omission key consists of all the unique consonants in the word ordered [48{ [6{  by  those  `least  likely  to  be  omitted',  with  the  least  likely  first,  (see [48{ [6{  Pollock  &  Zamora,  1984,  for  frequency  ordering);  all  unique  vowels  in  the [48{ [6{  word are appended in order. [48{ [6{  The  list  of  `similar'  words  found  is  taken  and  an  attempt  is  made  to [48{ [6{  transform  the  misspelling  into  one  of  these  dictionary  words  by  reversing [48{ [6{  one of the 4 basic edit operations or rules.  If the attempt is  successful [48{ [6{  the  word  is  taken  as  a  plausible  correction.  If  there  is  more  than  one [48{ [6{  plausible  correction  the  particular  edit  operation  involved  in  each  is [48{ [6{  considered.  The  focus  is  on  `the  most  likely  reconstruction,  given  the [48{ [6{  error'  rather  than  `the  most  commonly  applied  edit  operation'. [48{ [6{  Precedence  is  given  to  the  operations  in  the  following  order:  omission [48{ [6{  and  transposition  (equally);  insertion;  substitution.  If  more  than  one [48{ [6{  plausible correction  still  exist,  then database  frequencies  of  these  correc- [48{ [6{  tions  are  considered  and  the  most  likely  correction  selected  (Galli  and [48{ [6{  Yamada, 1968;Morgan, 1970). [6{   65 [48{ [6{  3.4.7. String to string repair [48{ [6{  In the  programs  surveyed  above, the  four  basic spelling  error  operations [48{ [6{  are  only  applied  to  misspellings  containing  a  single  error:  transformations [48{ [6{  are  created  by  a  single  application  of  one  operation.  This  transformation [48{ [6{  does  not,  however,  indicate  a  unique  correction  for  each  misspelling:  a [48{ [6{  number  of  possible  corrections  can  be  produced  from  the  same  misspelling [48{ [6{  by  a  single  transformation.  e.g.  the  misspelling  'sall'  can  be  transformed, [48{ [6{  by a single operation, into any of the following: ball call fall sail sell saul sale salt tall sally stall all hall wall Referring  to  these  edit  operations,  Pollock  (Pollock  and  Zamora, [48{ [6{  1984) states that: [48{ [6{"their  usefulness  lies  in  their  correspondence  to  real  world  error-creating operations and their ability to interconvert any pair  of strings." p. 361 [48{ [6{  By  successful  application  of  two  or  more  error  operations  (or  edit [48{ [6{  operations)  any  string  can  be  transformed  into  any  other  string.  For  any [48{ [6{  pair  of  error  and  correction,  there  will  be  a  variety  of  `sets  of [48{ [6{  operations' that  can be  used to  correct one  to  the other.  When  testing [48{ [6{  to  see  which  words  can  be  interconverted  in  one  operation,  the  maximum [48{ [6{  number  of  edits  to  be  tried  will  vary,  depending  on  the  operation.  To [48{ [6{  test  for  a  transposition  error,  each  pair  of  letters  is  swopped  in  turn, [48{ [6{  requiring  N-1  applications  of  the  transposition  operator  (see  previous [48{ [6{  section).  The  maximum  number  of  operations  required  to  test  for  trans- [48{ [6{  position  or  insertion  errors  (i.e.  by  deleting  each  letter  of  the  word,  in [48{ [6{  turn, and  checking  for a  match)  is small  and  dependent upon  word  length. [48{ [6{  For  substitution  and  omission  errors,  a  far  greater  maximum  number  of [48{ [6{  operations  is  required  to  test  for  possible  corrections.  Greater  time  is [48{ [6{  taken  by  string-distance  measurement  techniques  involving  substitution  and [48{ [6{  omission  errors,  especially  where  there  are  a  large  number  of  alternatives [48{ [6{  to be examined. [48{ [6{When  more  than  one  error  is  involved  the  maximum  number  of  opera- [48{ [6{  tions necessary, to test for all possible corrections, increases sharply. The [48{ [6{  number  of  plausible  corrections  found  also  increases  dramatically,  and  the [48{ [6{  likelihood  of  the  `correct'  one  being  find  decreases.  Pollock  &  Zamora [48{ [6{  conclude the following from this: [6{   66 [48{ [6{  "Correcting  multiple  errors  through  these  error  reversal  tech-  niques  produces  a  plethora  of  possible  corrections  with  low  plausibilities  -  a  poor  prognosis  unless  the  parent  vocabulary  is  extremely  small.  This  suggests  that  the  error  correction  tech-  nique  should  not  be  too  powerful  or  it  will  generate  too  many  false corrections". (Pollock and Zamora, 1984), p.361 [48{ [6{  However,  what  Pollock  &  Zamora  do  not  consider  is  the  possibility  of [48{ [6{  controlling  the  application  of  the  edit  operations  in  such  a  way  that [48{ [6{  `more  likely  error'  edits  are  considered  first.  This  is  done  using  digram [48{ [6{  frequency information, for single errors  only by Omond (Omond, 1977).  The [48{ [6{  possibility  of  weighting  the  operators  when  applying  them  to  particular [48{ [6{  letters has also been considered (Morgan, 1970). [48{ [6{  There  exists  a  substantial  literature  on  `string-to-string'  repair  theory, [48{ [6{  in  which repair  between strings (i.e.  transformations involving edit [48{ [6{  operations)  involving  multiple  operations  is  considered.  Wagner  &  Fisher [48{ [6{  (Wagner  and  Fisher,  1974)  define  a  general  notion  of  "distance"  between [48{ [6{  two  strings  and  present  an  algorithm  for  computing  this  distance.  They [48{ [6{  state  its  relevance  to  spelling  correction.  If  a  cost  is  assigned  to  each [48{ [6{  edit  operation,  then  the  cost  of  a  sequence  of  operations  is  the  sum  of [48{ [6{  the costs  of each  operation  in the  sequence.  There  may  be a  number  of [48{ [6{  different  sequences,  each  transforming  A  to  B,  each  having  different [48{ [6{  costs.  The  edit  distance  from  string  A  to  string  B  is  defined  as  the [48{ [6{  cost  of  the  sequence  with  minimum  cost.  They  suggest  that  the  cost [48{ [6{  functions  could  be  set  to  depend  on  the  particular  characters  affected [48{ [6{  by an edit operation: these could be used in spelling correction. [48{ [6{  Following  Wagner  &  Fisher,  Backhouse  (Backhouse,  1979)  presents  a  math- [48{ [6{  ematical  model  of  `error  repair',  where  a  repair  is  the  sequence  of [48{ [6{  operations  used  to  transform  one  string  into  another:  to  correct  an [48{ [6{  error.  He  includes transpositions  in  his edit  operations,  whereas  Wagner  & [48{ [6{  Fisher  do  not.  He  states  that  the  philosophy  behind  his  approach  is  to [48{ [6{  try  to  model  the  way  the  programmer  would  correct  his  own  syntax [48{ [6{  errors.  He also  considers, in chapter  5 of  his book  (Backhouse, 1979),  the [48{ [6{  repair of spelling errors, regular languages and context-free languages.  His [48{ [6{  approach  is  "to  consider  all  possible  ways  of  repairing  an  input  string  and [48{ [6{  to define in each case the concept of a `best' repair". [48{ [6{  The  best  repair  is  a  sequence  of  edit  operations  that  will  transform [6{   67 [48{ [6{  one string  to  another  at minimum  cost.  The  transformations that  may  be [48{ [6{  made  between  two  strings  (i.e.  the  edit  operations  that  can  be  applied) [48{ [6{  are  represented  as  a  graph.  If  each  operation  is  assigned  a  cost,  the [48{ [6{  `least-cost  path'  through  the  graph  can  be  found,  and  will  be  the  best [48{ [6{  repair,  or  the  minimum  cost  repair.  Detail  of  his  algorithm  is  given  in [48{ [6{  chapter  7.  It  appears  then  that  there  are  methods  that  may  be  used [48{ [6{  for the application of multiple operations, despite Pollock's misgivings. [6{   68 [48{ [6{  3.5. Summary [48{ [6{Computer programs  that  detect and  correct  spelling errors  are  available. [48{ [6{  They  can  be  interactive  or  they  may  provide  automatic  detection  and [48{ [6{  correction of errors. [48{ [6{Detection  and  correction  of  words  misspelt  as  other  words  is  difficult: [48{ [6{  research  on  the  use  of  syntax  and  semantics  has  not  progressed  far [48{ [6{  enough  to  provide  a  solution  to  this  problem.  Detection  methods  cur- [48{ [6{  rently  used  include  digram  and  trigram  analysis  and  dictionary  look-up. [48{ [6{  The latter is likely to be of most use here. [48{ [6{  Dictionary  size  (or  length  of  word  token  list)  affects  speed  and  ef- [48{ [6{  ficiency  of  error  detection  and  correction.  Larger  dictionaries  are  more [48{ [6{  likely  to  include  the  intended  word,  but  are  also  more  likely  to  provide [48{ [6{  incorrect matches to misspellings or to provide other words as corrections. [48{ [6{  Shorter  dictionaries  are  usually  quicker  to  search  but  are  less  likely  to [48{ [6{  contain the intended word. [48{ [6{  Words  in  the  dictionary  may  be  represented  as  strings  of  letters [48{ [6{  (including  abbreviations)  or can  be  coded  in  other  ways  (for  instance,  hash [48{ [6{  coding).  For  the  purposes  of  error  detection,  a  'word'  must  be  defined, [48{ [6{  e.g.  digits,  apostrophes,  hyphens  and  other  special  characters  may  or  may [48{ [6{  not  be  seen  as  part  of  a  word.  Instead  of  representing  complete  words [48{ [6{  in  a  dictionary,  root  words  and  their  affixes  could  be  stored.  This  would [48{ [6{  permit  a  larger  vocabulary  to  be  represented  in  a  smaller  dictionary  but [48{ [6{  requires reliable morphological analysis and reconstruction of words. [48{ [6{  Methods  of  error  correction  include:  looking  at  past  misspellings;  using [48{ [6{  measures  of  string  similarity;  editing  rules;  and  string-to-string  repair. [48{ [6{  Some  approaches  to  spelling  error  correction  use  combinations  of  more [48{ [6{  than one of these methods. Of the  methods, string-to-string repair is  of [48{ [6{  greatest  potential  interest.  The  relative  usefulness  of  these  correction [48{ [6{  methods,  in  the  context  of  their  use  with  children  with  spelling  dis- [48{ [6{  abilities, is discussed in chapter 5. [6{   I TABLE OF CONTENTS  3. Review of the Computing Literature 47 [48{ [6{  3.1. Introduction 47  3.2. Detection v. Correction 47  3.2.1. Applications 47  3.2.2. Modes of Use - Interactive Programs 48  3.2.3. Use of context: syntax and semantics 49  3.3. Error Detection 52  3.3.1. Digram and trigram frequencies 52  3.3.2. Dictionary look-up 52  3.3.3. Dictionary size 53  3.3.4. Representation and search 54  3.3.5. What is a word? 55  3.3.6. Dealing with affixes 57  3.4. Error Correction 58  3.4.1. Previous  errors  and  dictionaries  of  com- 58  mon misspellings 3.4.2. Matching substrings 59  3.4.3. Abbreviations 59  3.4.4. Measures of string similarity 60  3.4.5. Editing rules 62  3.4.6. Combined approaches 63  3.4.7. String to string repair 65  3.5. Summary 68