;[6{   131 CHAPTER 7  DETAIL OF THE EDITCOST AND PHONCODE PROGRAMS  7.1. Introduction [48{ [6{In  this  chapter  the  editcost  and  phoncode  spelling  correction  programs [48{ [6{  are described in detail. [48{ [6{  7.2. Calculating the minimum cost repair: the editcost program [48{ [6{  7.2.1. General overview [48{ [6{A shortlist of dictionary words is selected from the session dictionary as [48{ [6{  possible  candidates  for  the  misspelling.  If  all  words  in  the  dictionary [48{ [6{  were  considered  then  a  very  large  number  of  comparisons  using  the [48{ [6{  costing  algorithm  would  have  to  be  made.  The  object  of  shortlisting  is  to [48{ [6{  reduce  this  number  whilst  retaining  the  correction  of  the  misspelling  in [48{ [6{  the shortlist.  The  selection  depends  upon  the first  two  characters  of  the [48{ [6{  misspelling and its length. [48{ [6{The  misspelling  (termed  the  'inpw')  is  compared  with  each  dictionary [48{ [6{  word (termed  the 'dictw')  on the shortlist.  The  cost of  editing the  inpw [48{ [6{  to  match  the  dictw  is  calculated  (see  section  7.2.5).  Those  dictws  with [48{ [6{  lowest  edit  cost  are  saved.  The  four  dictws  with  lowest  cost  are [48{ [6{  offered to  the  user as  options  for  the correction  of  the  misspelling  (the [48{ [6{  inpw). [6{   132 [48{ [6{  7.2.2. Shortlisting candidates from the dictionary: 'shortlist' [48{ [6{  From  the  misspellings  made  by  the  observed  group  (see  chapter  4)  and [48{ [6{  the group in study 1 (see chapter 6), the following were noted:  - the  initial  two  letters  of  the  word  and  misspelling  and  the  frequency  of  each,  in  cases  where  the  first  letter  of  a  word  was misspelt; - the  range  of  differences  in  length  between  misspellings  and  corrections. Initial Characters [48{ [6{There  were  'first  letter'  errors  found  in  approximately  8%  of  misspell- [48{ [6{  ings  made  by  children  in  the  study  1.  For  these  cases,  therefore,  it [48{ [6{  could  not  be  assumed  that  the  first  letter  was  spelt  correctly.  First [48{ [6{  letter  confusions  found  were  used  to  construct  a  table  of  'alternative [48{ [6{  first  letters'.  For  any  inpw  only  those  words  from  the  dictionary  with [48{ [6{  the  same  initial  letter,  or  an  alternative  found  from  the  table,  are [48{ [6{  considered as candidate misspellings. [48{ [6{  In  some  cases  the  'alternative  first  letter'  is  extended  to  consider  the [48{ [6{  first two characters  of  the inpw  and dictw.  In  doing so  the  alternatives [48{ [6{  for  the  inpw  are  made  more  specific  and  the  shortlist  further  reduced [48{ [6{  (though the risk  of omission of  the correction  is increased).  For  example, [48{ [6{  if the  misspelling  is 'rite'  for  'write' then  all the  words in  the  dictionary [48{ [6{  with  initial  letters  'r'  or  'wr'  would  be  included  in  the  shortlist  whilst [48{ [6{  those others beginning with 'w' would not. [48{ [6{  The  dictionary  is  indexed  by  'first  character',  where  this  may  actually [48{ [6{  be  indicated  by  the  initial  two  characters  of  the  word,  or  by  the  first [48{ [6{  character only. In cases where two letters are used they are represented [48{ [6{  by  a  single  character  in  the  range  A-Z.  These  'special  cases'  of  first [48{ [6{  character are shown in figure 7-1 [48{ [6{  The  first  two  characters  of  the  inpw  are  read.  If  they  match  a [48{ [6{  special case  the  first  characters  are  then  represented  by  the  appropriate [48{ [6{  letter;  otherwise  they  are  represented  by  the  first  letter  of  inpw. [48{ [6{  Alternatives for the first character are found by  table look-up. They  are [48{ [6{  given in figure 7-1. All words in the section of the dictionary indexed by [48{ [6{  the  first  character,  and  by  the  alternatives  for  it,  are  included  in  the [6{   133 [48{ [6{  initial represented alternatives letter(s) by a a auoei b b bdp c c ckgsqCK ch C Ccjsk d d db e e eaiy f f fPTv g g gcjG gn G GKgn h h hHWo ho H Hho i i iuea j j jCg k k kKcq kn K KGnkc l l lL m m mn n n nGKNm o o oaHWO p p pbPN pn N Npn ph P Ppfv ps S Sps q q qkc r r rR s s sSc t t tT th T TfvtP u u uoy v v vTf w w wWO wo O Oow wr R Rwr wh W WwhOR x x xe y y yui z z zs  Figure 7-1: First character alternatives for shortlisting  shortlist  (subject  to  length  constraints).  For  example,  if  the  inpw  were [48{ [6{  'foto'  the  first  character  would  be  'f'  and  the  alternatives  for  it  would [48{ [6{  be  P(=ph),  T(=th)  and  v;  so  all  words  in  the  dictionary  with  initial  letters [48{ [6{  'ph', 'th', 'v' and 'f' would be considered for the shortlist of candidates.  Length constraints [48{ [6{Any  dictionary  word  considered  to  be  too  long  or  too  short  to  be  the [48{ [6{  correction  of  the  inpw  is  omitted  from  the  alternative  words  on  the [6{   134 [48{ [6{  shortlist.  The  range  of  length  permitted  was  determined  by  comparison [48{ [6{  of misspelling  and  corrections  lengths for  the  observed  group and  study  1 [48{ [6{  group. The dictw is shortlisted if:  - the  difference  between  the  length  of  the  inpw  (=inl)  and  the  length  of  the  dictw  (=dwl)  is  less  than  4,  when  both  inl  and  dwl are less than 10, or  - the  difference  between  inl  and  dwl  is  less  than  or  equal  to  (dwl/3)+1, when inl or dwl is 10 or greater. [48{ [6{  Those candidates on  the shortlist, to be  passed to the costing  algorithm [48{ [6{  are,  therefore,  those  words  from  the  session  dictionary  satisfying  two [48{ [6{  conditions: 1. with  the  same  initial  letter,  or  with  alternative  initial  letters,  as that of the input word;  2. with length within the range specified above. [48{ [6{  7.2.3. The editcost algorithm [48{ [6{  At  the  format  level  of  classification,  errors  are  described  in  terms  of [48{ [6{  the  editing  operation  that  is  applied  to  the  misspelling  to  get  the [48{ [6{  correct word (see  chapter  4).  For example,  to correct  'kat' to  'cat',  'k' [48{ [6{  must be changed to  'c': the error is  reversed to produce the  correction [48{ [6{  by the application of the edit operation 'change' to 'k' to produce 'c'. [48{ [6{  The  basic  editing  correction  method  is  used  in  a  number  of  spelling [48{ [6{  correction programs,  as  described in  chapter 3.  Whilst  in most  cases  this [48{ [6{  method  has  been  used  for  single  error  misspellings,  Backhouse  describes [48{ [6{  how  it  would  be  used  for  correction  of  multiple  error  misspellings:  for [48{ [6{  detail of his  algorithm and  Pascal implementation  see chapter  5  (Backhouse, [48{ [6{  1979). His method will be described here, in less detail. [48{ [6{  The  task  is  to  find  the  best  repair  for  transforming  a  string  E  (the [48{ [6{  error) into string C (the correction). The edit  operations that are used  to [48{ [6{  do the transformation are: - insert a character - delete a character - change a character [6{   135 [48{ [6{1[48{ [6{  - transpose two adjacent characters  Additionally,  leaving  a  letter  unchanged  can  be  considered  an  operation [48{ [6{  with no effect. [48{ [6{The edit operations will be indicated as follows:  b -> 0 delete 'b' from the error string 0 -> d insert 'd' in the error string b -> d change 'b' to 'd' ei -> ie transpose 'ei' to 'ie' a -> a leave 'a' unchanged There are  various ways  in which any  misspelling can  be edited  to form  the [48{ [6{  correction.  For  example,  some  of  the  ways  to  transform  'ricev'  to [48{ [6{  'receive' are: 1. r -> r 2. r -> r 3. r -> r 4. r -> 0 i -> e i -> 0 i -> e i -> 0 c -> c 0 -> e c -> c c -> 0 e -> e c -> c e -> e e -> 0 0 -> i 0 -> e v -> i v -> 0 v -> v 0 -> i 0 -> v 0 -> r 0 -> e ev -> ve 0 -> e 0 -> e 0 -> c 0 -> e 0 -> i 0 -> v 0 -> e In  the  fourth  example  all  the  characters  in  the  E  string  are  deleted  and [48{ [6{  all those in the C string are inserted. [48{ [6{  These  different  transformations  may  be  represented  as  a  graph,  where [48{ [6{  each node corresponds to a position in E string and a position in C string, [48{ [6{  and each arc corresponds to an edit operation. [48{ [6{  The  graph  in  figure  7-2  represents  the  tranformations  1.  to  4.  above. [48{ [6{  It can be seen that  there are many possible  paths through the graph.  In [48{ [6{  fact,  any  one  word  can  be  transformed  into  any  other  word  by  deletion [48{ [6{  of  all  the  error  string  characters  and  insertion  of  all  the  correction [48{ [6{  string  characters.  If  we  assign  costs  to  each  edit  operation  then  the [48{ [6{    [48{ [6{1[48{ [6{Transposition  can  only  be  made  between  adjacent  letters:  repeated  transpositions  are  not permitted. [6{   136 [48{ [6{C string r e c e i v e A. . . . . . . . E r . . . . . . . . s i t . . . . . . . . r i c n . . . . . . . . g e . . . . . . . . v . . . . . . . .B . . . . . . . . . . . ok change delete insert transpose  Figure 7-2: Graph representing edit operation transformations  cost  of  any  path  through  the  graph  (i.e.  the  cost  of  any  particular [48{ [6{  transformation  or  sequence  of  edit  operations)  is  the  sum  of  all  opera- [48{ [6{  tions  on  that  path.  The  'best  repair'  is  taken  to  be  the  path  through [48{ [6{  the graph, from  A to  B, with  least total  cost. For  example, if  unit  cost [48{ [6{  were assigned  to all  edit operations,  and zero  cost to  ok(no change),  then [48{ [6{  the  least  cost  path  in  the  above  example  would  be  path  1.  with  a  cost [48{ [6{  of 3 units. To calculate the least cost path [48{ [6{ Consider  two  strings  E  and  C,  with  lengths  m  and  n  respectively.  The [48{ [6{  graph  representing  the  transformations  between  these  strings  will  have [48{ [6{  dimensions  (0  to  m)  by  (0  to  n)  where  the  first  letters  of  each  string [48{ [6{  E(1) and C(1) are associated with the node (1,1) (see figure 7-3). [48{ [6{  The least  cost  path  from  the  node  (0,0)  to  any  point  in  the  graph  can [48{ [6{  be calculated.  For points (0,1),  (0,2)...(0,n) the least  cost will be the  cost [48{ [6{  of  inserting  letters  C(1),  C(2),  C(3)...C(n).  The  least  cost  path  from  (0,0) [48{ [6{  to (0,1) will be: [6{   137 [48{ [6{  ----> j C string | | r e c e i v e | V (0,0) (0,1) (0,2) (0,j) (0,n) i . . . . . . . . r (1,0) (1,1) (1,2) (1,j) (1,n) E . . . . . . . . i s t . . . . . . . . r c i (i,0) (i,1) (i,2) (i,j) (i,n) n . . . . . . . . g e . . . . . . . . v (m,0) (m,1) (m,2) (m,j) (m,n) . . . . . . . . Figure 7-3: Costgraph showing the node labelling  cost(ins(C(1))) [48{ [6{which  is  the  cost  of  inserting  the  character  that  is  in  position  1  in  the [48{ [6{  C string (inserted in the E string). [48{ [6{  Similarly the least cost path to (0,2) will be: [48{ [6{  cost(ins(C(1))) + cost(ins(C(2))) [48{ [6{  This  will  be  referred  to  as  the  minimum  cost  path  to  the  node  (0,2),  or [48{ [6{  mincost(0,2). Thus: [48{ [6{mincost(0,1) = cost(ins(C(1)))  mincost(0,2) = cost(ins(C(1))) + cost(ins(C(2))) = mincost(0,1) + cost(ins(C(2))) [48{ [6{  So for any node (0,j) in the path (0,0) to (0,n): [48{ [6{  mincost(0,j) = mincost(0,j-1) + cost(ins(C(j))) [48{ [6{  For any node (i,0) in the path (0,0) to (m,0): [48{ [6{  mincost(i,0) = mincost(i-1,0) + cost(del(E(i))) [6{   138 [48{ [6{  where  cost(del(E(i)))  is  the  cost  of  deleting  the  ith  character  of  the  E [48{ [6{  string, the error string. [48{ [6{Considering the node (1,1) there are four possible paths to this node: [48{ [6{  1. delete(E(1)) and insert(C(1))  insert(C(1)) and delete(E(1))  change E(1) to C(1)  if E(1) and C(1) are the same leave them unchanged [48{ [6{  The respective costs for each of these will be: [48{ [6{  (i) mincost(0,1) + cost(del(E(1)))  mincost(1,0) + cost(ins(C(1)))  mincost(0,0) + cost(change(E(1),C(1)))  mincost(0,0) + cost(ok(E(1),C(1))) [48{ [6{  The mincost(1,1) will be the mininum of these four costs. [48{ [6{  The  least  cost  for  any  node  (1,j)  in  the  path  (1,0)  to  (1,n)  is, [48{ [6{  therefore: [48{ [6{mincost(1,j) = minimum of (mincost(0,j) + cost(del(E(1))), mincost(1,j-1) + cost(ins(C(j))), mincost(0,j-1) + cost(change(E(1),C(j))), mincost(0,j-1) + cost(ok(E(1),C(j)))). [48{ [6{  and similarly for any node (j,1) in the path (0,1) to (m,1) [48{ [6{  mincost(i,1) = minimum of (mincost(i-1,1) + cost(del(E(i))), mincost(i,0) + cost(ins(C(1))), mincost(i-1,0) + cost(change(E(i),C(1))), mincost(i-1,0) + cost(ok(E(i),C(1)))). [48{ [6{  For  other  nodes  in  the  graph,  where  i>=2  and  j>=2,  the  cost  of [48{ [6{  transposition must also be considered: [48{ [6{  mincost(i,j), where 1>=2 and j>=2, = minimum of (mincost(i-1,j) + cost(del(E(i))), [6{   139 [48{ [6{mincost(i,j-1) + cost(ins(C(j))), mincost(i-1,j-1) + cost(change(E(i),C(j))), mincost(i-1,j-1) + cost(ok(E(i),C(j))), mincost(i-2,j-2) + cost(transp(E(i-1),E(i)))). [48{ [6{  The  cost  of  the  least  cost  path  through  the  graph  (i.e.  the  minimum [48{ [6{  cost repair  of  the  two strings)  is  the minimum  cost  path from  node  (0,0) [48{ [6{  to node (m,n), mincost(m,n): [48{ [6{  mincost(m,n) = minimum of (mincost(m-1,n) + cost(del(E(m))), mincost(m,n-1) + cost(ins(C(n))), mincost(m-1,n-1) + cost(change(E(m),C(n))), mincost(m-1,n-1) + cost(ok(E(m),C(n))), mincost(m-2,n-2) + cost(transp(E(m-1),E(m)))). [48{ [6{  Backhouse  discusses  the  use  of  the  algorithm  for  spelling  correction, [48{ [6{  assigning  unit  cost  to  each  edit  (Backhouse,  1979).  Here,  however,  dif- [48{ [6{  ferent  costs  have  been  assigned  to  each  edit  operation,  and  also  varying [48{ [6{  costs  are  assigned  according  to  the  particular  characters  involved  in  the [48{ [6{  edit.  Weightings  are  assigned  according  to  the  particular  edit  operation [48{ [6{  and  character,  and  from  these  weightings  the  costs  are  calculated. [48{ [6{  (Details  of  the  calculations  and  weightings  are  given  below).  Thus  for  any [48{ [6{  error  the  minimum  cost  of  transforming  it  to  match  any  word  in  the [48{ [6{  dictionary  can  be  calculated.  For  any  word  input  by  the  user,  those [48{ [6{  words  from  the  shortlist  with  lowest  'minimum  cost  repair'  are  selected, [48{ [6{  and  offered  as  options  for  correction  to  the  user.  Three  examples  of [48{ [6{  the editcost program in use are given in figure 7-4. [48{ [6{  7.2.4. Relation of the children's errors [48{ [6{  A major  objective of  the first  study was  to  collect data  on the  errors [48{ [6{  made  by  children  from  the  Reading  Unit,  when  writing  compositions. [48{ [6{  Details  of  this  study  are  given  in  chapter  6.  Depending  upon  the [48{ [6{  frequency  of  particular  errors  made  by  the  children,  weightings  were [48{ [6{  assigned  to  edit  operations  in  the  editcost  program.  Those  errors  made [48{ [6{  most  frequently  were  assigned  highest  weighting,  and  therefore  lowest [48{ [6{  cost,  which  influenced  the  selection  of  the  minimum  cost  repair  of  a [48{ [6{  misspelling. [48{ [6{For  each  child,  across  sessions,  the  frequency  of  incorrect  words  was [6{   140 [48{ [6{w:check What word do you want to check? w:berayd Wait a minute while I check it I have nearly finished checking it It could be buried bed bad board w:check What word do you want to check? w:houes Wait a minute while I check it I have nearly finished checking it It could be house hours horses hour w:check What word do you want to check? w:wen Wait a minute while I check it It could be when went win we Figure 7-4: Example of the spelling corrector in use  noted,  together  with  the  misspelling  and  correction.  For  each  misspelling, [48{ [6{  all  errors  within  the  misspelt  word  were  treated  separately,  and  were [48{ [6{  classified  according  to  whether  they  involved  deletion,  insertion,  trans- [48{ [6{  position,  or  the  changing  of  a  character.  The  transformation  of  spelling [48{ [6{  to  correction  involving  the  minimum  number  of  edit  operations  was  chosen [48{ [6{  in  most  cases,  using  the  subjective  judgement  of  the  author  to  decide [48{ [6{  the most likely error in  cases where a number of interpretations might  be [48{ [6{  made. [6{   141 [48{ [6{  These  judgements  were  based  on  previous  observation  of  the  children's [48{ [6{  errors,  and  reported  'frequent  errors'  of  other  researchers.  For  example, [48{ [6{  a preferance in  certain cases for  phonetic over non-phonetic  substitutions. [48{ [6{  Three  examples  of  the  choices  made  will  be  given  here.  In  the  case  of [48{ [6{  'panes'  misspelt  as  'pains'  two  possible  transformations,  each  involving  two [48{ [6{  edit operations, are: 1. p -> p 2. p -> p a -> a a -> a i -> n i -> 0 n -> e n -> n s -> s 0 -> e s -> s The second is  chosen as the  interpretation: the confusion  of 'ai' for  'a_e' [48{ [6{  was  considered  a  more  likely  error  than  the  confusion  of  'i'  and  'n'  and [48{ [6{  of 'n' and 'e'. [48{ [6{Similarly, in 'wigule' for 'wiggled' the second example is chosen:  1. w -> w 2. w -> w i -> i i -> i g -> g g -> g u -> g 0 -> g l -> l u -> 0 e -> e l -> l 0 -> d e -> e 0 -> d In this  case it  seemed more  likely that  the  second 'g'  had been  ommitted [48{ [6{  and  that  the  'u'  was  part  of  the  'ule'  grapheme,  rather  than  that  'u' [48{ [6{  and  'g'  had  been  confused.  In  some  cases,  where  the  transformation  was [48{ [6{  not clear, both sets of errors were counted:  'lifet' for 'left' 1. l -> l 2. l -> l i -> e i -> 0 f -> f fe -> ef e -> 0 t -> t t -> t [48{ [6{A frequency count  was made of  each operation involved  in an error,  and [48{ [6{  the  particular  letter(s)  involved.  Totals  for  errors,  according  to  the  edit [48{ [6{  operation involved, are given in figure 7-5. [48{ [6{  It can be  seen that  the most  frequent errors  were those  requiring  the [48{ [6{  insert operation for correction i.e.  letters were omitted from the  correct [6{   142 [48{ [6{  1. INSERT. 159 (43%) 2. DELETE. 78 (21%) 3. TRANSPOSE. 24 (7%) 4. CHANGE. 109 (29%) 5. TOTAL. 370 Operations: insert - insertion of a letter into error to get correction. delete - deletion of a letter from error to give correction. change - change of letter in error into corresponding letter in correction, error/correction. transpose - transposition of letters AB in error to BA in correction.  Figure 7-5: Study 1: frequency of error types  word. The change operation accounted for the next highest proportion of [48{ [6{  errors, with deletions  having slightly  lower percentage  occurrence. A  much [48{ [6{  smaller  percentage  of  errors  made  involved  transpositions.  These  findings [48{ [6{  are  in  accord  with  those  of  Masters  (Masters,  1927)  who  also  found [48{ [6{  insertions most frequent and transpositions the least frequent. [48{ [6{  Frequencies of  errors  for  specific characters,  for  all  children,  are  given [48{ [6{  in  figure  7-6.  The  number  of  children  making  each  error  (maximum  7)  is [48{ [6{  also given.  It should be  noted that, as  well as  considering the  application [48{ [6{  of  edit  operations  to  individual  characters  (a  to  z),  a  number  of  special [48{ [6{  cases  were  also  considered  e.g.  each  'l'  of  'll'  (double  'l'  in  error  or  in [48{ [6{  correction);  silent  initial  'k'  or  'w'  ('know'  or  'write');  final  'e'  ('cane'); [48{ [6{  silent second 'h' ('when'); the 'c' or 'k' of 'ck' ('back'). [48{ [6{  The most frequent errors were those involving vowels, 'e' in particular is [48{ [6{  involved in some 100 of the 370 errors. [48{ [6{  7.2.5. Detail of the editcost program [48{ [6{  The word  to  be  checked,  the  inpw,  is  compared with  each  word  on  the [48{ [6{  shortlist,  dictw,  in  turn.  For  each  pair,  inpw  and  dictw,  the  string  to [48{ [6{  string  repair  graph  is  constructed  and  all  cost  paths  through  the  graph [48{ [6{  are calculated,  using  the  algorithm  described  above.  Costs  are  determined [48{ [6{  by  weightings  assigned  by  the  program  at  the  start  of  each  session  (see [48{ [6{  below for details  of the weightings  and calculations).  The five words  with [6{   143 [48{ [6{  Insert Delete Change Transpose letter freq. letter freq. letters freq. letters freq. e 20/7 a 15/6 e/a 8/5 ae/ea 5/4 fnl e 20/6 fnl e 14/6 o/a 8/4 hg/gh 2/2 i 14/6 i 10/4 u/o 6/4 se/es 3/1 a 12/5 e 7/5 e/u 4/4 de/ed 2/1 r 12/5 o 5/3 a/e 4/4 ed/de 1/1 w 5/5 u 4/3 a/o 4/3 el/le 1/1 dbl l 7/4 r 4/3 s/c 3/3 le/el 1/1 scd h 7/4 scd h 4/2 m/n 3/3 er/re 1/1 g 7/3 g 2/2 d/b 3/2 en/ne 1/1 u 5/3 t 2/1 c/k 2/2 ol/lo 1/1 l 4/3 y 1/1 i/e 2/2 ye/ey 1/1 d 4/3 dbl r 1/1 a/u 2/2 uo/ou 1/1 o 4/3 dbl p 1/1 i/y 2/2 nh/hn 1/1 y 3/3 dbl g 1/1 t/p 2/2 fe/ef 1/1 t 5/2 dbl o 1/1 u/e 2/2 oh/ho 1/1 n 5/2 h 1/1 h/i 2/2 th/ht 1/1 c 3/2 kn k 1/1 k/c 3/1 dbl s 2/2 n 1/1 f/v 3/1 f 2/2 l 1/1 o/u 3/1 dbl t 2/2 dbl n 1/1 c/g 2/1 ck c 2/2 d 1/1 g/c 2/1 s 2/2 s/k 2/1 dbl r 2/2 o/e 1/1 dbl e 1/1 u/w 1/1 dbl d 1/1 t/r 1/1 Frequency of errors: frequencies given as the number of errors made/number of the students making the error. (Only 25 most frequent given for insert and change) Abbreviations: fnl = final (fnl e in make) dbl = double (dbl l in bell) scd = second (scd h in when) kn = silent k (knit) ck = c or k of pair ck Figure 7-6: Study 1: frequency of types of spelling errors  minimum  repair  costs  are  always  saved. When  all  the  dictws  on  the [48{ [6{  shortlist have been  compared with  the inpw,  the four  with minimum  repair [48{ [6{  cost  are  offered  to  the  user.  Examples  of  the  costs  of  the  five  saved [48{ [6{  dictws for  each  of four  inpws  are given  in  figure 7-7,  together  with  the [48{ [6{  intended word. [48{ [6{Costs  were  assigned  according  to  the  particular  character  and  edit [48{ [6{  operation  involved.  As  well  as  considering  the  characters  'a'  to  'z',  in  a [48{ [6{  number  of  cases  the  position  of  the  character,  and  adjacent  characters, [48{ [6{  were  also  considered.  These  cases  are  referred  to  as  'special  cases',  and [48{ [6{  were  assigned  weightings  independent  of  those  assigned  to  the  same [6{   144 [48{ [6{  inpw=reck wreck Options were wreck cost= 0.362 rock cost= 0.883 reach cost= 1.211 rocks cost= 1.433 recall cost= 1.573 inpw=roack rock Options were rock cost= 0.466 rocks cost= 1.016 wreck cost= 1.319 reach cost= 1.733 road cost= 1.766 inpw=kuver cover Options were cover cost= 1.082 curve cost= 1.677 keeper cost= 1.733 corner cost= 2.293 keep cost= 2.316 inpw=bilt built Options were built cost= 0.407 belt cost= 0.550 bit cost= 0.764 belts cost= 1.100 build cost= 1.290  Figure 7-7: Example of candidates and mininum editcost  characters  in  other  contexts.  For  example,  'e'  in  the  final  position  in  a [48{ [6{  string  (error  or  correction)  had  a  different  weighting  from  'e'  in  other [48{ [6{  positions. A list of  special cases is  given in figure  7-8. When  considering [48{ [6{  any character  in  the E  or  C strings,  it  was first  tested to  see if  it  was [48{ [6{  one of these  'special cases'. If  it was, it  was recoded with  an upper  case [48{ [6{  character, as shown in the  figure 7-8. For example, the 'k' in 'know'  was [48{ [6{  recoded as 'B'. [48{ [6{Weightings assigned were represented as array values  in the range 2.5  to [48{ [6{  10.  Four  arrays  of  weightings  were  set  up,  one  for  each  edit  operation [48{ [6{  (no  weighting  was  recorded  for  the  ok  edit). A  list  of  the  actual [48{ [6{  weightings  assigned  is  given  in  figure  7-9.  The  weighting  for  deleting  or [48{ [6{  inserting  any  character  is  stored  in  the  one-dimensional  arrays  deletew- [48{ [6{  tarray  and  insertwtarray,  respectively,  indexed  by  the  character  itself  (or [6{   145 [48{ [6{  Representation Letter(adjacent letter) A final e B k(n) at start of word C w(r) or w(h) D h in second position E t(t) F r(r) G o(o) H e(e) I p(p) J b(b) K k(k) L m(m) M l(l) N s(s) O f(f) P n(n) Q d(d) R z(z) S c(k) T (c)k U g(h) Figure 7-8: Representation of 'special cases'  character  representation  in  special  cases).  The  changewtarray  and  trans- [48{ [6{  posewtarray are  two-dimensional. Weightings  are given  in the  figure for  a [48{ [6{  number of  change  and  transpose combinations.  The  cost  of  changing  pairs [48{ [6{  of characters  is  considered separately  in  a number  of  cases: e.g.  the  cost [48{ [6{  of changing  'f'  to 'th',  or  'ff' to  'gh'  are indexed  separately.  (see  figure [48{ [6{  7-9). [48{ [6{For  each  pair  of  nodes  to  be  compared  the  costgraph  is  constructed. [48{ [6{  At each node  the pair  of characters  E(i) and  C(j) is  considered:  they  are [48{ [6{  tested  first  to  see  if  they  are  'special  cases'.  Weightings  for  each  edit [48{ [6{  operation  at  the  node  (i,j)  are  found.  Insert  weightings  and  delete [48{ [6{  weightings are the values of the array elements indexed by the characters. [48{ [6{  If  there  is  a  match  for  a  transposition,  that  is  if  E(i-1)=C(j)  and [48{ [6{  E(i)=C(j-1),  then  the  transpose  edit  weighting  is  given  by  the  tranposew- [48{ [6{  tarray  element  (E(i-1),E(i)).  If  E(i)=C(j),  that  is  if  the  characters  match, [48{ [6{  then  the  edit  cost  is  zero,  otherwise  the  change  edit  weighting  is  found. [48{ [6{  Characters preceding  and  following  E(i) and  C(j)  are  noted.  Adjacent  pairs [48{ [6{  of  characters,  E(i),E(i+1)/E(i-1),E(i)/C(j),C(j+1)/C(j-1),C(j),  are  compared  with [48{ [6{  the  characters  pairs  listed  under  'changewts'  (see  figure  7-9)  and  if  a [48{ [6{  match  is  found  weightings  are  assigned  accordingly:  otherwise  the  value [6{   146 [48{ [6{  insertwt deletewt changewt transwt a 9.0 a 6.0 a/e 7.0 ae 9.0 b 4.0 b 2.5 a/o 7.0 au 8.0 c 5.0 c 2.5 a/u 5.0 de 8.0 d 7.0 d 3.5 b/d 4.0 ea 9.0 e 10.0 e 5.0 c/k 5.0 ei 9.0 f 5.0 f 2.5 e/a 8.0 el 9.0 g 7.0 g 4.0 e/i 4.0 es 9.0 h 8.0 h 3.5 e/u 7.0 er 9.0 i 9.0 i 5.0 h/i 5.0 ed 8.0 j 4.0 j 2.5 i/e 5.0 ey 8.0 k 4.0 k 2.5 i/y 6.0 ef 8.0 l 7.0 l 3.5 k/c 4.0 fe 8.0 m 4.0 m 2.5 m/n 6.0 gh 8.0 n 5.0 n 3.5 n/m 4.0 hg 8.0 o 7.0 o 5.0 o/u 4.0 hn 8.0 p 4.0 p 2.5 p/b 6.0 ht 8.0 q 4.0 q 2.5 s/c 6.0 ie 9.0 r 9.0 r 4.0 t/p 5.0 le 9.0 s 5.0 s 3.5 u/a 4.0 nh 8.0 t 5.0 t 3.5 u/e 5.0 re 9.0 u 7.0 u 5.0 u/o 7.0 se 9.0 v 4.0 v 2.5 y/i 4.0 th 8.0 w 8.0 w 2.5 f/gh 4.0 ua 8.0 x 4.0 x 2.5 f/th 6.0 ye 8.0 y 7.0 y 4.0 f/ph 6.0 all others=7.0 z 4.0 z 2.5 g/ch 4.0 A 10.0 A 6.0 j/ch 4.0 B 8.0 B 3.5 v/th 4.0 C 8.0 C 2.5 w/gh 4.0 D 8.0 D 4.0 y/gh 4.0 E 7.0 E 3.5 y/ie 5.0 F 9.0 F 4.0 ch/j 4.0 G 7.0 G 5.0 ch/t 4.0 H 10.0 H 5.0 ff/gh 4.0 I 6.0 I 3.5 gh/ff 3.5 J 6.0 J 3.0 oo/ue 4.5 K 7.0 K 4.0 ow/ue 4.5 L 6.0 L 3.0 th/f 4.5 M 8.0 M 3.5 all others=3.0 N 7.0 N 3.5 O 6.0 O 3.0 P 7.0 P 4.0 Q 7.0 Q 3.5 R 6.0 R 3.0 S 7.0 S 3.0 T 6.0 T 3.0 U 4.0 U 4.0 Figure 7-9: Weightings assigned to edit functions [6{   147 [48{ [6{  stored in the  changewtarray element  (E(i),C(j)) is  the change  edit  weighting [48{ [6{  for the  current  node.  The  weightings are  used to  calculate the  cost  for [48{ [6{  each edit operation at each node, where [48{ [6{  cost = 0.05 + (2.5/weighting) [48{ [6{  The maximum  cost  of  a single  edit  is  1.05  (weighting=2.5)  and  the  minimum [48{ [6{  cost  is  0.3  (weighting=10).  Using  the  algorithm  described  above,  the [48{ [6{  minimum cost of each node  is calculated, for all  nodes, and the least  cost [48{ [6{  path through the graph determined. [48{ [6{  The  weightings  were  chosen  by  considering  those  errors  most  frequently [48{ [6{  made  by  the  children  in  Study  1,  and  by  experimenting  with  different [48{ [6{  costs  and  weighting  functions,  testing  them  on  a  subset  of  the  errors. [48{ [6{  They  were  set  up  such  that  2  very  frequent  errors,  that  might  often [48{ [6{  occur together, would  have lower cost  than a single  (less likely error),  and [48{ [6{  hence  the  matched  dictw  involving  2  edit  operations  could  be  chosen  in [48{ [6{  preference to  another dictw  involving only  1 edit  operation.  For  example, [48{ [6{  if  the  misspelling  is  "wud"  it  could  be  edited  to  "would"  by  inserting  "o" [48{ [6{  and  "l"  (2  operations,  cost=.814).  It  could  be  edited  to  match  "mud"  with [48{ [6{  1  operation,  change  "w"  to  "m"  (cost=.838).  The  match  to  "would"  is [48{ [6{  cheaper.  In  theory,  3  of  the  'cheapest'  edits  (minimum  cost  0.3,  weight- [48{ [6{  ing  10)  would  be  less  costly  than  1  of  the  most  expensive  (maximum  cost [48{ [6{  1.05,  weighting=2.5):  it  is  very  difficult,  however,  to  imagine  cases  when [48{ [6{  this  would  happen.  Though  the  transposition  errors  were  the  least [48{ [6{  frequent  of  the  error types  they  were  given  high  weightings:  this  was  to [48{ [6{  increase  the  likelihood  of  any  error  being  classed  as  a  transposition,  if  it [48{ [6{  could be considered as such. [48{ [6{  An  example  will  be  given  to  illustrate  the  method  of  calculation  of  the [48{ [6{  minimum  cost  repair.  For  the  four  paths  given  in  figure  7-2,  the  graph [48{ [6{  showing  the  weightings,  four  of  the  possible  transformations,  and  the [48{ [6{  costs  for  each  transformation  are  given.  The  weightings  for  the  trans- [48{ [6{  formations are given in figure 7-10. [48{ [6{  The transformations are: [6{   148 [48{ [6{ricev -> receive C string r e c e i v e . . . . . . . . r E 4 ok . . . . . . . . i s 5 5 5 t . . . . . . . . r c i 2.5 10 ok n . . . . 10 . 9 . . . g e 5 ok . . . . . 9 . . 7 . v 2.5 3 ok . . . . . . . . 9 10 5 10 9 4 10 Figure 7-10: Graph showing example weightings  1. r -> r 2. r -> r 3. r -> r 4. r -> 0 i -> e i -> 0 i -> e i -> 0 c -> c 0 -> e c -> c c -> 0 e -> e c -> c e -> e e -> 0 0 -> i 0 -> e v -> i v -> 0 v -> v 0 -> i 0 -> v 0 -> r 0 -> e ev -> ve 0 -> e 0 -> e 0 -> c 0 -> e 0 -> i 0 -> v 0 -> e The costs for each transformation will be:  1. 0 + .55 + 0 + 0 + .327 + 0 + .3 = 1.177  2. 0 + .55 + .3 + 0 + .3 + .327 + .407 = 1.884  3. 0 + .55 + 0 + 0 + .883 + .675 + .3 = 2.408  4. .675 + .55 + 1.05 + .55 + 1.05  + .327 + .3 + .55 + .3 +  .327 +  .675 + .3 = 6.654 [48{ [6{The  performance  of  the  editcost  algorithm,  together  with  possible  im- [48{ [6{  provements and extensions, is discussed in chapter 8. [6{   149 [48{ [6{  7.3. Phonemic coding of words: the phoncode program [48{ [6{  7.3.1. General overview [48{ [6{When  phoneme-grapheme  correspondence  rules  are  used  to  generate [48{ [6{  spellings,  misspellings  may  occur  when  incorrect  rules  are  used,  or  when [48{ [6{  the  rules  are  inappropriate  (for  example  an  irregularly  spelt  word).  If  a [48{ [6{  word is misspelt in this way, it might be corrected as follows:  1. isolate the graphemes in the word  2. infer  which  phoneme-grapheme  correspondence  rules  may  have  been used to produce the graphemes  3. select  a  set  of  phonemes  that  could  have  generated  these  graphemes 4. use the set of phonemes to generate the correct spelling [48{ [6{  In  the  first  three  steps,  the  phoneme-grapheme  correspondence  rule  is [48{ [6{  being  used  in  reverse.  This  is  not  equivalent,  however,  to  using  the [48{ [6{  grapheme-phoneme  rules  as  in  reading.  This  method  of  correction  is  the [48{ [6{  basis of the phoncode program, to be described here. [48{ [6{  7.3.2. Related work [48{ [6{Research  on  grapheme  to  phoneme  correspondence  rules  is  of  interest [48{ [6{  here,  in  that  it  gives  some  guide  to  pronunciation  of  correctly  spelt [48{ [6{  words,  and  also  for  regular  non-words.  As  discussed  in  chapter  2, [48{ [6{  grapheme-phoneme  rules  can  be  specified  more  easily  than  phoneme- [48{ [6{  grapheme  rules.  Work  by  Venezky  is  of  particular  relevance  here  (Venezky, [48{ [6{  1966).  The  grapheme-phoneme  rules  are  of  limited  use,  however,  since [48{ [6{  they do not provide information  about all possible phonemes that might  be [48{ [6{  generated;  or  which  alternative  graphemes  could  be  used  in  place  of  the [48{ [6{  given one, in order to produce the same phoneme. [48{ [6{  Work  on  computer  text-to-speech  production  systems  is  also  of [48{ [6{  relevance (Ellovitz  et al,  1976), (Ciarcia,  1982), (Allen,  1981), but  has  similar [48{ [6{  limitations.  The  focus  is  on  specifying  grapheme-phoneme  rules  in  certain [48{ [6{  graphemic contexts, and most difficulties are encountered in these  systems [48{ [6{  with pronunciations of irregularly spelt words. [6{   150 [48{ [6{  Research  on  phoneme-grapheme  correspondence  is  of  more  relevance [48{ [6{  here,  in  particular  the  studies  by  Hanna  et  al  (Hanna  et  al,  1966),  and  by [48{ [6{  Simon  and  Simon  (Simon  and  Simon,  1973).  These  studies  are  discussed  in [48{ [6{  Chapter  2.  Information  about  the  correct  graphemes  for  representing [48{ [6{  phonemes  in  specific  words  was  obtained  in  these  studies.  The  phoneme- [48{ [6{  grapheme  correspondences  that  were  generated,  used  in  reverse,  would [48{ [6{  provide  some  of  the  information  required  for  the  phoncode  program. [48{ [6{  However,  as  with  Venezky's  work,  and  the  text-to-speech  research, [48{ [6{  legitimate  correspondences  (phoneme-grapheme  and  grapheme-phoneme)  in [48{ [6{  specific  contexts  were  being  studied:  the  aim  was  to  produce  correct [48{ [6{  pronunciation  or  spelling,  using  context-specific  rules.  As  the  concern  in [48{ [6{  this  thesis  is  misspellings,  the  correspondence  rules  to  be  used  must  be [48{ [6{  largely  context-free.  A  misspelling  might  well  be  a  'correct'  phoneme- [48{ [6{  grapheme  correspondence  used  in  an  'incorrect'  context.  Therefore,  infor- [48{ [6{  mation from these studies is useful, but not sufficient. [48{ [6{  7.3.3. Design of the phoncode program [48{ [6{  The  phoncode  program  takes  an  input  word  (inpw)  and  selects  those [48{ [6{  words  from  the  dictionary  that  can  be  considered  phonemically  equivalent [48{ [6{  to  it.  The  dictionary  words  matched  are  offered  to  the  user  as  possible [48{ [6{  corrections for the inpw. See figure 7-11 for some examples.  :wear Is it one of these? wear where were :turtul Is it one of these? turtle :thiar Is it one of these? their there :somb Is it one of these? some sum Figure 7-11: Examples of use of the phoncode program [6{   151 [48{ [6{  The  phoncode  program  works  by  segmenting  the  inpw  into  graphemes; [48{ [6{  finding  phonemes  that  these  graphemes  might  represent  (in  any  context); [48{ [6{  using a set  of these  phonemes to  search for  legitimate words  represented [48{ [6{  by matching phonemes. Decisions were  made concerning selection of the  set [48{ [6{  of  graphemes  (into  which  words  were  segmented),  the  set  of  phonemes [48{ [6{  (used  to  represent  all  words),  and  the  grapheme-phoneme  correspondences. [48{ [6{  Information from four sources was used in making these decisions:  - errors made by children in both studies;  - related work by Hanna et al (Hanna et al, 1966);  - related work by Simon and Simon (Simon and Simon, 1973);  - additional examples generated by the author. [48{ [6{  7.3.4. Defining the set of phonemes [48{ [6{  In  the  Hanna  study  words  were  phonemically  coded  using  a  62-phoneme [48{ [6{  classification  scheme,  based  on  the  Merriam-Webster  dictionary  pronun- [48{ [6{  ciation code (Hanna  et al, 1966).  This scheme  used 32  vowel phonemes  and [48{ [6{  30  consonant  phonemes.  Later  they  reduced  the  number  of  vowel [48{ [6{  phonemes  in  the  classification  scheme,  combining  the  weakest  vowel  forms [48{ [6{  2[48{ [6{to  the  schwa ,  and  combining  several  other  categories  where  distinctions [48{ [6{  between  phonemes  were  not  clear,  e.g.  reducing  the  two  categories  /A/ [48{ [6{  and  /A1/  (ale  and  chaotic)  to  a  single  category,  /A/.  The  total  number  of [48{ [6{  vowel phoneme categories  was reduced to  22. Ellovitz et  al. (Ellovitz et  al, [48{ [6{  1976)  adopted  a  scheme  of  41  phonemes,  16  vowel  phonemes  and  25 [48{ [6{  consonant  phonemes,  in  their  text-to-speech  program.  The  phonemes  were [48{ [6{3[48{ [6{  used  in  converting  english  text  to  IPA  representation  ,  which  is  then [48{ [6{  passed to the Votrax synthesizer (Ellovitz et  al, 1976). Morris-Wilson, in  his [48{ [6{  phonemic transcription textbook, used a representation of 44 phonemes,  24 [48{ [6{  consonantal and 20  vowel phonemes,  taken from  'Gimson's Set'  (see  (Morris- [48{ [6{  Wilson, 1984) for details and references). [48{ [6{    [48{ [6{2[48{ [6{the vowel sound occurring in unstressed syllables e.g. about, vowel, mother [48{ [6{  3[48{ [6{International Phonetic Alphabet [6{   152 [48{ [6{  The  set  used  in  this  study  comprised  46  phonemes  in  all,  20  vowel [48{ [6{  phonemes  and  26  consonantal  phonemes.  Figures  7-12  and  7-13  show  the [48{ [6{4[48{ [6{  phonemes used, with examples . Morris- IPA Hanna Notation here Examples Wilson et al (text) (code) late, day eI eI A/A1 eI 10 air, care e A2 eE 12 bat, add A3/A4/A6 ae 13 car, aunt : a A5 a: 15 about, silent A7/E4/I4 E 57 O4/U4 beat, keep i: i E/E1 i 50 here, ear I I E2 IE 52 end, let e E3 e 53 maker, urn : E5/U2 e: 55 ice, high aI aI I aI 90 ill, bit I I I3 I 93 boat, know U o O/O1 EU 150 port, saw : O2 o: 152 pot, soft O3/O5 O 153 food, rude u: u O6 u 156 foot, book U O7 U 157 cube, unite - ju U/U1 ju 200 up, son ^ ^ U3 ^ 203 oil, boy I I OI oI 158 out, cow aU a OU aU 159 honest - - H9 - - late - - E9 - -  Figure 7-12: Representation of vowel phonemes  The  20  vowel  phonemes  were  equivalent  to  20  of  the  vowel  phonemes [48{ [6{  used  by  Hanna,  combining  the  /E5/  and  /U2/  categories  to  match  com- [48{ [6{  pletely.  Of  the  vowel  phonemes,  19  of  the  20  used  by  Morris-Wilson  were [48{ [6{  used  here.  The  /U2/  category  was  omitted  (combined  with  the  /o:/ [48{ [6{  category), and  a  separate  vowel phoneme  /ju/  was  added:  Morris-Wilson  has [48{ [6{  a  consonant  category  /j/,  but  no  independent  vowel  category.  Two [48{ [6{  consonantal  categories  were  added  to  the  Morris-Wilson  set  to  make  the [48{ [6{  set  here:  /ks/  and  /kw/  are  added,  as  they  were  considered  to  be  single [48{ [6{  phonemes  in  certain  contexts.  This  set  is  equivalent  to  the  Hanna  set [48{ [6{  with  category  pairs  /L/  and  /L1/,  /M/  and  /M1/,  /N/  and  /N1/,  /W/  and [48{ [6{    [48{ [6{4[48{ [6{The  code  column  indicates  the  integer  code  used  in  the  actual  program:  the  text  notation will generally be used in this thesis for ease of reading. [6{   153 [48{ [6{  Examples Morris- IPA Hanna Notation here Wilson et al (text) (code) bad, rub b b B b 20 bad, day d d D d 40 fat, rough f f F f 60 go, big g g G g 70 hit, behind h h H h 80 gin, joke dz dz J dz 100 keep, cock k k K k 110 loud, kill l l L,L1 l 120 mad, jam m m M,M1 m 130 man, no n n N,N1 n 140 pit, top p p P p 160 run, bread r r R r 170 sit, loss s s S s 180 trap, step t t T t 190 very, love v v V v 210 wash, when w w W,HW w 220 yellow, yet j j Y y 230 zoo, beds z z Z z 240 chair, lunch t t CH ch 31 ethics, accent - - KS ks 111 quick, aqua - kw KW kw 112 sing, along NG ng 142 sugar, bush SH sh 181 theatre, thank T1 th 191 that, with T2 tv 192 garage,pleasure ZH zh 241 honest - - H9 - -  Figure 7-13: Representation of consonant phonemes  /WH/  each  being  considered  as  one  category.  The  basis  for  combination  of [48{ [6{  phoneme  categories,  considered  distinct  by  Hanna  et  al,  was  that  in  many [48{ [6{  contexts  the  pronunciations  would  be  considered  indistinguishable  to  a [48{ [6{  non-linguist,  both  when  pronounced  by  the  author,  or  by  the  children [48{ [6{  from the reading unit. [48{ [6{  7.3.5. Phoneme-grapheme correspondences [48{ [6{  When  considering  a  word  and  its  phonemic  representation,  how  do  we [48{ [6{5[48{ [6{  decide  which  character  or  characters  represent  each  phoneme?  In  cases [48{ [6{  such as the following the relationship is clear: [48{ [6{    [48{ [6{5[48{ [6{Note that the convention here will be grapheme = /phoneme/. [6{   154 [48{ [6{  cat = /k/ /ae/ /t/ dog = /d/ /O/ /g/ biting = /b/ /aI/ /t/ /I/ /ng/ c = /k/ d = /d/ b = /b/ a = /ae/ o = /O/ i = /aI/ t = /t/ g = /g/ t = /t/ i = /I/ ng = /ng/ In  some  other  words  it  may  be  a  little  more  difficult  to  decide,  but  an [48{ [6{  interpretation can be given: abbey = /ae/ /b/ /i/ nation = /n/ /eI/ /sh/ /E/ /n/ science = /s/ /aI/ /E/ /n/ /s/ a = /ae/ n = /n/ sc = /s/ bb = /b/ a = /eI/ i = /aI/ ey = /i/ ti = /sh/ e = /E/ o = /E/ n = /n/ n = /n/ ce = /s/ With  the  following  examples,  however,  it  is  much  more  difficult  to  decide [48{ [6{  which graphemes correspond to each phoneme:  lamb = /l/ /ae/ /m/ l = /l/ l = /l/ a = /ae/ a = /ae/ m = /m/ mb = /m/ b = ? receive = /r/ /E/ /s/ /i/ /v/ r = /r/ r = /r/ r = /r/ e = /E/ e = /E/ e = /E/ c = /s/ ce = /s/ ce = /s/ e = /i/ i = /i/ i+e = /i/ i = ? ve = /v/ v = /v/ v = /v/ e = ? vague = /v/ /eI/ /g/ v = /v/ v = /v/ v = /v/ a = /eI/ a = /eI/ a = /eI/ g = /g/ gu = /g/ gue = /g/ ue = ? e = ? [48{ [6{ It is necessary to specify the full set of graphemes such that each  may [6{   155 [48{ [6{  be  considered  to  correspond  to  a  single  phoneme  (or  a  number  of [48{ [6{  alternative single phonemes). Hanna et al. (Hanna et  al, 1966) and Simon  and [48{ [6{  Simon  (Simon  and  Simon,  1973)  each  give  a  set  of  phoneme-grapheme [48{ [6{  correspondences,  the  latter's  set  being  a  slight  modification  of  the [48{ [6{  former's.  Hanna  et  al  list  107  different  graphemes,  and  Simon  and  Simon [48{ [6{  list  104.  Yannakoudakis  and  Fawthrop  (Yannakoudakis  and  Fawthrop, [48{ [6{  1983b) divide words into vowel and consonant elements, using 267 elements. [48{ [6{  A  number  of  specific  problems  were  encountered  when  attempting  to [48{ [6{  segment  words  into  graphemes,  some  of  the  problems  being  illustrated  in [48{ [6{  the examples above. [48{ [6{For  example,  consider  'silent  letters',  that  is,  single  characters  which [48{ [6{  apparently do not represent a  phoneme. The  final 'e' is  a special case  of [48{ [6{  this  in  words  where  it  is  used  to  modify  the  pronunciation  of  the [48{ [6{  preceding  vowel  (often  referred  to  as  the  'magic  e'  in  teaching).  It [48{ [6{  modifies the vowel, changing a soft vowel into a hard one: can = /k/ /ae/ /n/ cane = /k/ /eI/ /n/ fin = /f/ /I/ /n/ fine = /f/ /aI/ /n/ cut = /k/ /^/ /t/ cute = /k/ /ju/ /t/ The  'e'  may  be  treated  as  a  separate  silent  grapheme,  or  it  may  be [48{ [6{  considered  part  of  the  prececeding  vowel  grapheme,  represented  in  'cane' [48{ [6{  as  'a_e'.  Hanna  et  al  (Hanna  et  al,  1966)  dealt  with  this  problem  by [48{ [6{  inventing a separate /E9/ phoneme category to represent the grapheme  'e'. [48{ [6{  If  it  is  represented  in  this  way,  it  is  no  longer  linked  to  the  preceding [48{ [6{  vowel.  However,  in  some  cases,  where  the  'e'  is  not  final  but  still [48{ [6{  modifies the preceding vowel, the 'e' is sounded: cases = /k/ /eI/ /s/ /I/ /z/ The  'e'  corresponds  to  /I/.  The  graphemic  construction  in  which  the  'e' [48{ [6{  occurs  will  be  described  as  a  'vc+e'  grapheme  (vowel,  consonant  +  'e').  In [48{ [6{  some  cases  the  vowel  and  'e'  will  be  taken  to  correspond  to  a  single [48{ [6{  phoneme,  'a_e'  in  'cane';  in  others  they  represent  two  phonemes  and  are [48{ [6{  treated as separate graphemes. [48{ [6{  Other silent letters  cause difficulties,  for example the  'b' in 'lamb',  the [48{ [6{  'k'  in  'know',  'g'  and  'h'  in  'high',  'u'  in  'guard'.  Similarly  in  cases  of [48{ [6{  double  letters  only  one  is  pronounced:  'cotton',  'tell',  'pass'.  The  silent [48{ [6{  letter  may  be  given  no  corresponding  phoneme,  or  may  be  assigned  to  an [6{   156 [48{ [6{  adjacent  grapheme  with  a  corresponding  phoneme.  In  the  former  case  a [48{ [6{  letter  that  might  be  silent  in  some  contexts  could  be  placed  anywhere  in [48{ [6{  a  word  and  not  affect  its  pronunciation:  if  a  phoneme  /b0/  represented [48{ [6{  a silent 'b' then 'at'  would be a  possible misspelling of  'bat', or 'brubn'  a [48{ [6{  misspelling  of  'run'.  If  the  silent  letter  is  treated  as  part  of  another [48{ [6{  grapheme,  then  the  cases  in  which  it  would  be  considered  a  legitimate [48{ [6{  phonetic  misspelling  are  less  arbitrary.  If  'k'  is  silent  when  followed  by [48{ [6{  'n',  in  the  initial  position  of  a  word,  the  grapheme  'kn'  could  be [48{ [6{  considered  as  a  possible  correspondence  to  /n/,  producing  the  following [48{ [6{  'phonetic equivalents': knot = /n/ /O/ /t/ not = /n/ /O/ /t/ kn = /n/ n = /n/ o = /O/ o = /O/ t = /t/ t = /t/ [48{ [6{ In  discussing  the  silent  letter  graphemes,  Hanna  et  al  (Hanna  et  al, [48{ [6{  1966) state that: [48{ [6{"Considerable  disagreement  occurs  among  linguists  regarding  how  such graphemes should be classified." p. 14 [48{ [6{  They were all considered in this study to be part of a grapheme that  had [48{ [6{  a corresponding phoneme:  examples of  graphemes incorporating these  'silent [48{ [6{  graphemes' are 'ce' since 'gh' ghost 'wh' what 'wr' write 'kn' know 'bt' debt A  complete  list  of  the  grapheme-phoneme  correspondences  used  in  this [48{ [6{  thesis is given in appendix B. [48{ [6{  The  schwa  /E/  also  presents  problems,  particularly  in  cases  where  it  is [48{ [6{  followed by /n/: opening = /EU/ /p/ /E/ /n/ /I/ /ng/ The  'en'  is  represented  by  two  phonemes,  /E/  and  /n/.  However,  a [48{ [6{  possible  misspelling  of  'opening',  where  the  'e'  is  omitted,  is  'opning'.  To [48{ [6{  cater  for  this  and  similar  cases,  'en'  was  treated  as  a  grapheme [48{ [6{  representing a single phoneme, as was 'on' and 'an'. [48{ [6{  In  some  dialects  the  'r'  in  'er'  in  not  pronounced:  the  'er'  corresponds [48{ [6{  to a single phoneme, /e:/ or /E/. In  other dialects the 'r' is sounded  and [6{   157 [48{ [6{  the  'er'  corresponds  to  two  phonemes,  /E/  and  /r/.  To  allow  for  both [48{ [6{  cases  'er'  was  taken  as  a  grapheme  representing  a  single  phoneme,  or [48{ [6{  could  be  split  to  represent  two  phonemes.  Other  graphemes  involving  'r' [48{ [6{  were treated in the same way: ayor oar air aer are ar ear re er ier ir our oor ore or ure ur Each corresponds  to  a  single vowel  phoneme,  or  to  a  single  vowel  and  /r/. [48{ [6{  Some  also  correspond  to  the  single  phoneme  /r/,  or  may  be  split  further [48{ [6{  into  more  than  two  phonemes,  e.g.  'ier'  in  'carrier'.  A  large  number  of [48{ [6{  these correspondences were derived from the children's errors. [48{ [6{  The phoneme /sh/ is spelt in a variety of different ways: 'ssi' mission 'ss' reassure 'ti' nation 'si' mansion 'ch' machine 'sh' show 'ci' special All these were represented as individual graphemes. [48{ [6{  A  large  number  of  graphemes  correspond  to  a  single  phoneme  in  some [48{ [6{  cases,  but  to  more  than  one  phoneme  in  others.  Because  of  this  they [48{ [6{  must be taken  to correspond  to single  phonemes but also  able to be  split [48{ [6{  into  smaller  graphemes,  corresponding  to  other  phonemes.  For  example, [48{ [6{  the grapheme 'ough': cough = /k/ /O/ /f/ though = /tv/ /EU/ In  'cough'  it  corresponds  to  two  phonemes;  in  'though'  only  a  single  one. [48{ [6{  Doubled  characters,  however,  and  a  number  of  other  graphemes  occur  only [48{ [6{  as single phonemes in the majority of cases:  'igh' high = /h/ /aI/ 'll' will = /w/ /I/ /l/ 'ch' chip = /ch/ /I/ /p/ 'ph' phone = /f/ /EU/ /n/ These graphemes,  corresponding  to one  single phoneme,  will be  referred  to [48{ [6{  as 'tied' graphemes. If  a tied grapheme occurs in  a word or a  misspelling [48{ [6{  the  corresponding  phonemes  were  found,  and  the  grapheme  was  segmented [48{ [6{  no  further.  Graphemes  that  are  not  tied  were  further  segmented  (or [48{ [6{  split)  to  find  constituent  graphemes  and  corresponding  phonemes.  Tied [48{ [6{  graphemes are indicated by a "1"  preceding them in the  grapheme-phoneme [6{   158 [48{ [6{  correspondence  table  (figure  B-1,  appendix  B),  and  segmentable  graphemes [48{ [6{  by a 0. All 'vc+e' graphemes are segmentable, and are indicated by *. [48{ [6{  7.3.6. Phonemic coding of the dictionary [48{ [6{  The  inpw  was  segmented  into  all  possible  graphemes,  and  phoneme  cor- [48{ [6{  respondences  were  found  for  each  grapheme.  All  words  in  the  dictionary [48{ [6{  were  coded  phonemically.  The  set  of  phonemes  representing  the  inpw [48{ [6{  were  compared  with  those  representing  the  dictionary  words:  if  matches [48{ [6{  were found they were taken to be "possible phonemic equivalents". [48{ [6{  Each  word  in  the  dictionary  was  hand  coded  by  the  author  in  conjunc- [48{ [6{  tion  with  a  linguist  specializing  in  speech  synthesis.  The  general  guidelines [48{ [6{  suggested  in  the  text  "English  Phonemic  Transcription"  (Morris-Wilson, [48{ [6{  1984)  were  used  in  coding.  'Strong  forms'  were  used  in  most  cases  (see [48{ [6{  p  76-82,  (Morris-Wilson,  1984))  though  the  'weak  form'  alternatives  were [48{ [6{  also used in a few cases. [48{ [6{For  a  number  of  the  words,  the  ambiguity  of  possible  alternative [48{ [6{  pronunciations  meant  that  more  than  one  phonemic  coding  of  a  word  was [48{ [6{  included  in  the  coded  dictionary.  In  particular,  'er'  at  the  end  of  a [48{ [6{  word  was  coded  both  as  /E/  and  as  /E/  /r/  to  allow  for  variations  in [48{ [6{  dialect. 'er'  occurring  elsewhere  in  a  word  was  coded  as  /e:/  when [48{ [6{  representing  'r-less'  dialects. Other  'vowel  +  r'  graphemes  were  also [48{ [6{  coded with and without the 'r'. board = /b/ /o:/ /r/ /d/ & = /b/ /o:/ /d/ car = /k/ /a:/ & = /k/ /a:/ /r/ error = /e/ /r/ /E/ /r/ & = /e/ /r/ /E/ [48{ [6{ The 'd'  in  'procedure',  'produce'  and  'soldier' may  be  pronounced  /d/  or [48{ [6{  /dz/, so both are represented: produce = /p/ /r/ /O/ /d/ /ju/ /s/ & = /p/ /r/ /O/ /dz/ /u/ /s/ [48{ [6{ Unstressed  vowels  were  coded  as  schwa,  /E/.  In  some  words  both  the [48{ [6{  strong and the weak forms of vowels were coded: [6{   159 [48{ [6{  around = /eI/ /r/ /aU/ /n/ /d/ around = /E/ /r/ /aU/ /n/ /d/ In  general  when  there  was  uncertainty  about  the  vowel  phoneme,  the  /E/ [48{ [6{  was used and matched to all single vowel graphemes. [48{ [6{  In some  words, the  schwa  was omitted,  and  the following  consonant  only [48{ [6{  was coded. The schwa is unpronounced in some dialects.  difference = /d/ /I/ /f/ /E/ /r/ /E/ /n/ /s/ & = /d/ /I/ /f/ /r/ /E/ /n/ /s/ edinburgh = /e/ /d/ /I/ /n/ /b/ /r/ /E/ & = /e/ /d/ /I/ /n/ /b/ /E/ /r/ /E/ factory = /f/ /ae/ /k/ /t/ /r/ /I/ & = /f/ /ae/ /k/ /t/ /E/ /r/ /I/ [48{ [6{ The  graphemes  'el'  'le  '  and  'en'  also  presented  some  problems.  In [48{ [6{  cases where they were clearly  pronounced with a  schwa the vowel  phoneme [48{ [6{  was coded separately, e.g. kitchen = /k/ /I/ /ch/ /E/ /n/ pixel = /p/ /I/ /ks/ /E/ /l/ In  other  cases  the  representation  was  not  clear  and  the  word  was  coded [48{ [6{  with and without the schwa: jewels = /dz/ /ju/ /w/ /E/ /l/ /z/ & = /dz/ /ju/ /w/ /l/ /z/ open = /EU/ /p/ /n/ & = /EU/ /p/ /E/ /n/ simple = /s/ /I/ /m/ /p/ /l/ & = /s/ /I/ /m/ /p/ /E/ /l/ In coding some other words the schwa was omitted:  people = /p/ /i/ /p/ /l/ little = /l/ /I/ /t/ /l/ [48{ [6{ The graphemes 'lm' 'ld' and 'lk' also present coding difficulties:  calmly = /k/ /a:/ /m/ /l/ /I/ milk = /m/ /I/ /l/ /k/ could = /k/ /u/ /d/ shoulder = /sh/ /EU/ /l/ /d/ /E/ & = /sh/ /EU/ /l/ /d/ /E/ /r/ & = /sh/ /EU/ /d/ /E/ & = /sh/ /EU/ /d/ /E/ /r/ talk = /t/ /o:/ /k/ & = /t/ /o:/ /l/ /k/ The  coding  here  is  not  consistent.  The  difficulties  that  it  presents  are [6{   160 [48{ [6{  discussed  in  chapter  8.  Other  ambiguities  and  problems  with  the  coding [48{ [6{  are also discussed in that chapter. [48{ [6{  7.3.7. Detail of the phoncode program [48{ [6{  The  phonetic  coding  program  takes  the  input  word,  inpw,  and  segments [48{ [6{  it  into  all  combinations  of  graphemes.  The  phonemes  that  can  be [48{ [6{  represented  by  each  grapheme  are  found  by  table  look-up.  All  sequences [48{ [6{  of  phonemes  that  may  be  considered  to  represent  the  inpw  are  compared [48{ [6{  with  the  dictionary.  The  dictionary  is  represented  as  a  tree,  where  each [48{ [6{  node represents a phoneme, and also stores information about  actual words [48{ [6{  (see figure  7-19).  If  a sequence  of  phonemes (representing  inpw)  matches [48{ [6{  a  path  in  the  tree,  and  the  final  node  in  the  path  also  contains  a  word [48{ [6{  (stored  as  a  string),  then  that  word  is  offered  as  a  phonetic  match  to [48{ [6{  inpw. [48{ [6{7.3.8. Segmenting the word [48{ [6{  .  Each  grapheme  in  the  grapheme-phoneme  table  is  compared  with  the [48{ [6{  inpw  string.  If  there  is  a  match,  the  phonemes  corresponding  to  the [48{ [6{  grapheme,  and  the  position  of  the  grapheme  in  the  word,  are  noted.  If [48{ [6{  the  character  is  marked  as  a  'tied'  grapheme  the  characters  in  it  are [48{ [6{  segmented  no  further:  these  characters  in  the  word  are  no  longer  acces- [48{ [6{  sible. If the  grapheme  is not  tied it  may be  segmented further.  The  next [48{ [6{  grapheme  is  compared  with  those  remaining  in  the  inpw  string  that  are [48{ [6{  still accessible.  Again,  if  the  match  succeeds,  corresponding  phonemes  and [48{ [6{  position in the  string are  noted: tied phonemes  are marked as  inaccessible. [48{ [6{  The  process  is  continued  until  all  graphemes  in  the  table  have  been [48{ [6{  compared, or until there are no accessible characters left in the string. [48{ [6{  For  example,  if  the  inpw  were  'caught',  the  grapheme  'augh'  would [48{ [6{  match  to  characters  2  to  5  in  the  inpw  string.  The  phonemes  cor- [48{ [6{  responding  to  this  grapheme  are  /o:/  and  /a:/.  The  grapheme  'augh'  is  not [48{ [6{  tied,  and  may  be  further  split.  The  next  matching  grapheme  in  the  table [48{ [6{  is 'gh', with  phonemes /f/ and  /g/. This grapheme is  marked as tied, so  no [48{ [6{  further  segmentation  of  the  grapheme  'gh'  is  required.  These  matches [48{ [6{  may be represented in the following format: [6{   161 [48{ [6{  grapheme => alternative phonemes augh => /o:/ /a:/ gh => /f/ /g/ The remaining matches found are: au => /EU/ /aU/ /o:/ /a:/ /o/ c => /k/ /s/ /sh/ t => /t/ /ch/ /sh/ /d/ /th/ [48{ [6{ The  set  of  phonemes  representing  the  inpw  may  also  be  represented  as [48{ [6{  a finite state grammar, as in figure 7-14. /o/ /EU/ /d/ /k/ /aU/ /f/ /t/ 1 /s/ 2 3 4 5 6 /ch/ end /sh/ /o:/ /g/ /sh/ /a:/ /th/ /o:/ /a:/  Figure 7-14: Finite State Grammar representation of 'caught'  If  n  is  the  length  of  the  inpw,  there  will  be  (n+1)  nodes,  the  (n+1)th [48{ [6{  being  the  end  node.  Each  node  indicates  the  position  of  a  character  in [48{ [6{  the inpw:  arcs  leaving  the  nodes  represent  phonemes  corresponding  to  the [48{ [6{  graphemes  commencing  with  that  character,  in  that  position.  Some  nodes [48{ [6{  have  no  arc  connected  to  them:  these  represent  the  characters  occurring [48{ [6{  in  tied  graphemes.  In  figure  7-14  the  character  'h'  in  the  string  'caught' [48{ [6{  corresponds  to  node  5.  The  grapheme  'gh'  is  marked  as  tied,  represented [48{ [6{  by  phonemes  /f/  and  /g/.  The  arcs  leaving  the  'g'  node,  node  4,  are [48{ [6{  labelled /f/ and  /g/. No  arc goes  to or  from the  'h' node,  node 5:  it  has [48{ [6{  no corresponding phoneme in this case. [48{ [6{  The  representation  of  the  phonemes  corresponding  to  a  word  may  also [48{ [6{  be given in a slightly different form, as in figure 7-15. [48{ [6{  In this format, the nodes  are numbered, and  the phonemes labelling  arcs [48{ [6{  from  the  node  are  given,  as  before.  The  arcs  are  replaced  by  explicit [48{ [6{  values  of  the  'next  node',  from  each  node.  In  the  finite  state  grammar [48{ [6{  representation,  a  sequence  of  phonemes  representing  the  string  is  found [48{ [6{  by traversing  the arcs  in the  direction  indicated from  node  1 to  the  end [48{ [6{  node,  selecting  any  one  arc  from  each  node,  and  moving  to  each  node [6{   162 [48{ [6{  inpw='caught' node = 1 phon = /k/ , next = 2 phon = /s/ , next = 2 phon = /sh/, next = 2 node = 2 phon = /o:/, next = 6 phon = /a:/, next = 6 phon = /O/ , next = 4 phon = /EU/, next = 4 phon = /aU/, next = 4 phon = /o:/, next = 4 phon = /a:/, next = 4 node = 3 node = 4 phon = /f/ , next = 6 phon = /g/ , next = 6 node = 5 node = 6 phon = /d/ , next = end phon = /t/ , next = end phon = /ch/, next = end phon = /sh/, next = end phon = /th/, next = end node = end  Figure 7-15: Alternative representation of 'caught'  connected  by  the  arcs,  in  turn.  The  labels  of  each  arc  traversed  are  the [48{ [6{  phonemes  representing  the  string.  All  possible  sequences  are  found  by [48{ [6{  following  each  arc  from  a  node,  in  order,  and  backtracking  until  all  paths [48{ [6{  between node  1  and the  end node  have been  traversed. In  the  alternative [48{ [6{  representation,  a  sequence  of  phonemes  is  found  by  starting  at  node  1, [48{ [6{  selecting  any  phoneme  from  the  list  for  that  node,  and  moving  to  the [48{ [6{  'next node' as indicated. Further phonemes are  selected, and moves to  the [48{ [6{  'next node' made, until the end node is encountered. All possible sequences [48{ [6{  are  found  by  selection  of  each  phoneme  from  each  node  list  in  turn, [48{ [6{  backtracking to cover all possible combinations. [48{ [6{  We might consider listing all  possible sequences of phonemes  representing [48{ [6{  the  inpw:  however,  even  for  this  simple  example,  there  are  180  possible [48{ [6{  sequences,  or  paths.  This  is  calculated  as  follows.  If  the  number  of [48{ [6{  possible  paths  between  nodes  1  and  2  is  rewritten  as  paths(1,2),  and  the [48{ [6{  path  from  node  1  to  end  via  nodes  2,4,  and  6  is  written  as  [1,2,4,6,end] [48{ [6{  : [6{   163 [48{ [6{  paths(1,2) = 3 paths(2,4) = 5 paths(2,6) = 2 paths(4,6) = 2 paths(6,end) = 5 total paths(1,end) = (3 x 5 x 2 x 5) [1,2,4,6,end] + (3 x 2 x 5) [1,2,6,end] = 150 + 30 = 180 possible paths [48{ [6{ The set of graphemes  and corresponding phonemes,  used in the  phoncode [48{ [6{  program is given in figure  B-1 in appendix B.  A "1" preceding a  grapheme [48{ [6{  indicates  that  it  is  tied;  a  "0"  indicates  that  it  may  be  segmented [48{ [6{  further; a  "*"  represents  the 'vc+e'  construction.  So,  in the  word  'cane' [48{ [6{  the  'a_e'  is  represented  in  the  grammar  by  '*a'.  Similarly,  in  'raise', [48{ [6{  'ai_e'  is  represented  by  '*ai'.  The  *  marked  graphemes  are  matched  if [48{ [6{  the  vowel  or  vowel  digraph  matches,  and  the  next  but  one  character [48{ [6{  after  it  is  'e'.  There  are  two  sets  of  representation  for  the  'vc  +  e' [48{ [6{  grapheme.  In  the  first,  the  segmentation  is  'v'  and  'c+e':  the  'e'  is  not [48{ [6{  represented  by  any  separate  phoneme.  In  the  finite  state  grammar [48{ [6{  representation,  the  arc  representing  the  consonant  preceding  the  'e' [48{ [6{  passes  directly  to  the  node  corresponding  to  the  character  following  the [48{ [6{  'e'. For example, the grapheme-phoneme correspondences for 'rime' are:  i_e => /i/ /aI/ /I/ /e/ /E/ m => /m/ r => /r/ /e:/ /E/ In  the  second  representation  of  'vc+e',  the  structure  is  split  'v'  and  'c' [48{ [6{  and  'e',  the  vowel  or  vowel  digraph,  the  consonant,  and  the  'e'  all  being [48{ [6{  represented independently: e => /i/ /IE/ /e/ /e:/ /E/ /I/ /eI/ /eE/ /ae/ i => /e:/ /E/ /aI/ /I/ /y/ /i/ /e/ /^/ m => /m/ r => /r/ /e:/ /E/ The corresponding finite state grammars for each are shown in figure 7-16 [48{ [6{  and figure 7-17. [48{ [6{These  two  sets  of  phoneme  representation  for  'rime'  cannot  be  con- [48{ [6{  sidered  in  one  finite  state  grammar  form:  the  arcs  between  nodes  2  and [48{ [6{  3  in  the  first  case  (  with  4  passed  over)  are  not  alternatives  for  the [48{ [6{  arcs  between  nodes  2  and  3  in  the  second  case.  There  is  no  legitimate [6{   164 [48{ [6{/i/ /r/ /aI/ /m/ 1 /e:/ 2 /I/ 3 4 end /E/ /e/ /E/ Figure 7-16: Finite state grammar representation of 'rime': v, c+e /e:/ /i/ /E/ /IE/ /aI/ /e/ /r/ /I/ /e:/ 1 /e:/ 2 /y/ 3 /m/ 4 /E/ end /E/ /i/ /I/ /e/ /eI/ /^/ /eE/ /ae/  Figure 7-17: Finite state grammar representation of 'rime': v, c, e  path  following  the  arcs  between  nodes  2  and  3  in  the  second  case,  along [48{ [6{  arc labelled  /aI/,  and then  following  arcs between  nodes  3 and  the  end  in [48{ [6{  the first case, along arc labelled /m/. [48{ [6{  The  two  sets  of  representation  for  'vc+e'  graphemes  can,  however,  be [48{ [6{  combined  in  the  'alternative  representation'  format.  In  figure  7-18  the [48{ [6{  full  representation  of  'rime'  is  given.  To  indicate  the  form  in  which  'e' [48{ [6{  is  passed  over,  the  'next  node'  of  the  vowel  has  a  value  of  the [48{ [6{  corresponding  (following)  consonant  node,  plus  100.  If  a  vowel  phoneme  is [48{ [6{  selected,  with  a  nextnode  value  greater  than  100,  the  node  for  the [48{ [6{  consonant  is  taken  (i.e.  the  next  node  less  100)  and  a  phoneme  value  is [48{ [6{  obtained: the  value  of the  consonant  nextnode  is increased  by  1,  and  this [48{ [6{  new value becomes the following node, i.e. the 'e' node is passed over. In [48{ [6{  this example if at node  2 the phoneme /^/  were selected (next=103),  then [48{ [6{  at  node  3  the  phoneme  /m/,  next=end,  would  be  chosen,  and  any  arcs [48{ [6{  from node 4 would be skipped over. [48{ [6{  Further examples of 'parses' of input words are given in appendix B. [6{   165 [48{ [6{  inpw = 'rime' node = 1 phon = /r/ next = 2 phon = /e:/ next = 2 phon = /E/ next = 2 node = 2 phon = /i/ next = 3 phon = /aI/ next = 3 phon = /I/ next = 3 phon = /e/ next = 3 phon = /E/ next = 3 phon = /e:/ next = 3 phon = /^/ next = 3 phon = /y/ next = 3 phon = /E/ next = 103 phon = /aI/ next = 103 phon = /I/ next = 103 phon = /i/ next = 103 phon = /e/ next = 103 node = 3 phon = /m/ next = 4 phon = /m/ next = end node = 4 phon = /i/ next = end phon = /IE/ next = end phon = /e/ next = end phon = /e:/ next = end phon = /E/ next = end phon = /I/ next = end phon = /eI/ next = end phon = /eE/ next = end phon = /ae/ next = end node = end  Figure 7-18: Representation of 'vc + e' graphemes [48{ [6{  7.3.9. Representing the dictionary [48{ [6{  Given  a  file  of  words  coded  phonemically,  a  dictionary  is  constructed [48{ [6{  representing  the  set  of  words  as  a  tree  where  each  node  stores  a [48{ [6{  phoneme value. The phoneme  representation used is  the integer one,  given [48{ [6{  in  figures  7-12  and  7-13.  Examples  of  words  and  their  phonemic [48{ [6{  representation are: he /h/ /i/ 80 50 help /h/ /e/ /l/ /p/ 80 53 120 160 helped /h/ /e/ /l/ /p/ /d/ 80 53 120 160 40 nine /n/ /ai/ /n/ 140 90 140 air /eE/ /r/ 12 170 These are represented in tree form in figure 7-19. [6{   166 [48{ [6{start of tree > 12 . . / >"air" > 170 / / . > 80 . . / >"he" > 50 . / . > 53 / . / > 140 / . / > 120 / . / >"help" > 90 / . / > 160 / . . >"nine" >"helped" > 140 / / . > 40 / / . Key to nodes: 1 2 3 4 1 = integer representing phoneme 2 = pointers to alternative phoneme at same level 3 = pointer to next phoneme in word 4 = pointer to word Figure 7-19: Tree representation of part of the phoncode dictionary  The dictionary  tree  to  represent these  five  words  would  be  built  up  as [48{ [6{  follows. The  first phoneme  in each of  the words 'he,  'help' and '  helped' [48{ [6{  is  the  same,  80,  and  so  would  all  be  stored  in  the  same  node.  The [48{ [6{  second  phoneme  is  50  in  'he'  and  53  in  'help'  and  'helped'.  The  right [48{ [6{  offspring of the  node storing  80 will  store 50:  the left offspring  of  this [48{ [6{  node  (50)  stores  the  'alternative  phoneme',  53.  For  other  words  with [48{ [6{  first phoneme  80  and  an  alternative  second  phoneme,  e.g.  'hill'  80  93  120, [48{ [6{  the  second  phoneme  is  added  in  the  left  offspring  of  the  node  storing [48{ [6{  53. The word 'he'  is represented by two  phonemes, both now stored:  the [48{ [6{  string 'he'  is  also  stored in  the  node  with the  last  phoneme in  the  word, [48{ [6{  50.  The third  phoneme  in 'help',  120,  is stored  in  the right  offspring  of [48{ [6{  that  with  53.  This  also  matches  the  next  phoneme  in  'helped'  and  does [48{ [6{  not  need  to  be  represented  twice.  The  right  offspring  of  this  node [48{ [6{  (storing 120) stores the fourth  phoneme, 160. This is the last phoneme  in [48{ [6{  'help', so  the string  'help'  is also  stored  in this  node.  The  final  phoneme [48{ [6{  in  'helped'  is  added  as  the  right  offspring,  storing  40  and  the  string [48{ [6{  'helped'. [48{ [6{The  word  'nine'  has  first  phoneme  140,  and  is  stored  as  the  left [6{   167 [48{ [6{  offspring  (alternative  first  phoneme)  to  the  node  containing  80. The [48{ [6{  remaining  phonemes  in  the  word  'nine'  are  stored  in  the  right  offspring [48{ [6{  of this  node  (storing  140).  The  string  'nine' is  stored  also with  the  final [48{ [6{  phoneme. [48{ [6{The  word  'air'  is  coded  12  170.  It  does  not  match  80  or  140,  the [48{ [6{  alternative  initial  phonemes.  It  must  be  added  as  another  alternative [48{ [6{  initial  phoneme,  as  a  left  offspring.  However,  the  left  offsprings  are [48{ [6{  ordered  in  increasing  order.  The  node  storing  12  becomes  the  root  node [48{ [6{  and the parent of the node storing 80, this latter node becoming its  left [48{ [6{  offspring:  the new  node is  inserted in  the tree.  The  second phoneme  is [48{ [6{  added as the  right offspring  of this  new node  and the string  'air'  stored [48{ [6{  also. [48{ [6{Further  words  may  be  added.  The  initial  phoneme  is  compared  with  each [48{ [6{  left offspring node, starting  with the current  root node.  If no match  is [48{ [6{  found  the  first  phoneme  of  the  word  is  inserted  in  the  correct  position [48{ [6{  in  the  ordering;  the  remaining  phonemes  in  the  word  are  added  as  right [48{ [6{  descendants.  If  a  match  is  found  the  next  phoneme  (the  second)  is [48{ [6{  compared  firstly  with  the  right  offspring  of  the  matched  phoneme  node, [48{ [6{  and  then  with  its  offsprings,  and  left  descendants  of  this  offspring. [48{ [6{  Again,  if  no  match  is  found,  a  new  node  is  inserted  in  the  correct [48{ [6{  position,  and  the  remaining  phonemes  added;  otherwise,  if  a  match  is [48{ [6{  found, the  next  phoneme is  compared  with firstly  the  right  offspring  and [48{ [6{  then  with  its  left  descendants.  The  process  is  repeated  until  all [48{ [6{  phonemes  in  the  word  are  matched,  or  have  been  added/inserted.  The [48{ [6{  word string  is  stored  in  the  same  node  as  the last  phoneme  in  the  word. [48{ [6{  In  this  way  a  file  of  phonemically  coded  words  may  be  stored  as  a  tree [48{ [6{  structured dictionary. [48{ [6{7.3.10. Matching the word and dictionary [48{ [6{  The  comparison  part  of  the  phoncode  program  takes  the  dictionary  of [48{ [6{  phonemically  coded  words,  represented  as  a  tree,  and  an  inpw,  parsed  to [48{ [6{  show  all  sequences  of  phonemes  that  could  have  produced  it.  The [48{ [6{  alternative  representation  of  the  parsed  inpw  (as  in  figure  7-15)  will  be [48{ [6{  used  here  to  describe  the  matching  process:  it  will  be  referred  to  as [48{ [6{  the inpw network. In the program the integer phonemic coding is used, [6{   168 [48{ [6{  Starting at the  root  of the  dictionary tree,  and the  first node  of  the [48{ [6{  inpw  network,  the  first  phoneme  listed  at  the  inpw  node  1  is  compared [48{ [6{  with the  dictionary  root  node phoneme.  The  phonemes  listed  at  the  inpw [48{ [6{  nodes are in  increasing order,  as are the  left descendant dictionary  nodes. [48{ [6{  If  the  compared  phonemes  match,  then  the  next  inpw  node  is  found [48{ [6{  (numbered  as  'next')  and  the  first  phoneme  of  this  new  inpw  node  is [48{ [6{  compared  with  the  right  offspring  phoneme  of  the  matched  dictionary [48{ [6{  node.  If  the  match  succeeds  then  the  comparison  process  is  repeated. [48{ [6{  If  the  match  fails,  the  inpw  node  phoneme  is  compared  with  the  left [48{ [6{  offspring  dictionary  node  phoneme.  Taking  account  of  the  fact  that [48{ [6{  dictionary  and  inpw  nodes  are  ordered,  whenever  a  match  fails  the  next [48{ [6{  inpw  node  or  dictionary  word  node  is  tried.  For  example,  when  an  inpw [48{ [6{  phoneme 80 fails to match a dictionary node phoneme 60, and also fails  to [48{ [6{  match  its  left  offspring  90,  it  is  pointless  to  compare  it  with  the  left [48{ [6{  descendants of the  node storing 90,  as the  phonemes will  be greater  than [48{ [6{  90.  Instead  the next  inpw  phoneme is  compared  with the  dictionary  node [48{ [6{  storing 90.  If  this inpw  phoneme is  120  then the  match would  again  fail, [48{ [6{  and  the  next left  offspring  of  the  dictionary  would  be  compared  instead. [48{ [6{  If  at  any  point  there  are  no  more  phonemes  listed  at  the  inpw  node,  or [48{ [6{  no  more  left  offspring  phonemes  to  compare,  then  the  program  back- [48{ [6{  tracks to the last pair of nodes matched,  fails that match, and  continues. [48{ [6{  If  matching  succeeds  such  that  the  end  node  of  the  inpw  network  is [48{ [6{  reached then  any  word  strings  stored  in  the  last  matched  dictionary  node [48{ [6{  are  saved  as  'possible  phonemic  equivalents'  for  the  inpw. All  paths [48{ [6{  through  the  inpw  network  are  tried  and  the  dictionary  is  searched [48{ [6{  exhaustively. All possible matches are found. [48{ [6{  In  appendix  B  a  set  of  examples  is  given:  the  phonetic  coding  of  a [48{ [6{  small example dictionary is shown; examples of  inpws, parsed to produce  all [48{ [6{  sets of  phonemes, are  given.  Successful matches  for each  inpw, from  the [48{ [6{  example dictionary, are given. [48{ [6{  The  editcost  and  phoncode  programs  described  in  this  chapter  were [48{ [6{  tested  on  a  corpus  of  errors  produced  by  children  with  spelling  dis- [48{ [6{  abilities. Additionally, the  editcost program  was used  and tested  by a  small [48{ [6{  group  of  children  (see  Study  2  in  chapter  6).  The  performance  of  each [48{ [6{  of the programs is discussed in chapter 8. [6{   I TABLE OF CONTENTS  7. Detail of the editcost and phoncode programs  131 [48{ [6{  7.1. Introduction 131  7.2. Calculating  the  minimum  cost  repair:  the  edit- 131  cost program 7.2.1. General overview 131  7.2.2. Shortlisting candidates from the diction- 132  ary: 'shortlist' 7.2.3. The editcost algorithm 134  7.2.4. Relation of the children's errors 139  7.2.5. Detail of the editcost program 142  7.3. Phonemic coding of words: the phoncode program 149  7.3.1. General overview 149  7.3.2. Related work 149  7.3.3. Design of the phoncode program 150  7.3.4. Defining the set of phonemes 151  7.3.5. Phoneme-grapheme correspondences 153  7.3.6. Phonemic coding of the dictionary 158  7.3.7. Detail of the phoncode program 160  7.3.8. Segmenting the word 160  7.3.9. Representing the dictionary 165  7.3.10. Matching the word and dictionary 167 [6{   II LIST OF FIGURES  Figure 7-1: First character alternatives for 133  shortlisting  Figure 7-2: Graph  representing  edit  operation 136  transformations  Figure 7-3: Costgraph showing the node labelling 137  Figure 7-4: Example  of  the  spelling  corrector  in 140  use Figure 7-5: Study 1: frequency of error types 142  Figure 7-6: Study  1:  frequency  of  types  of  spell- 143  ing errors  Figure 7-7: Example  of  candidates  and  mininum 144  editcost  Figure 7-8: Representation of 'special cases' 145  Figure 7-9: Weightings assigned to edit functions 146  Figure 7-10: Graph showing example weightings 148  Figure 7-11: Examples  of  use  of  the  phoncode 150  program  Figure 7-12: Representation of vowel phonemes 152  Figure 7-13: Representation of consonant phonemes 153  Figure 7-14: Finite  State  Grammar  representation 161  of 'caught'  Figure 7-15: Alternative representation of 162  'caught'  Figure 7-16: Finite  state  grammar  representation 164  of 'rime': v, c+e  Figure 7-17: Finite  state  grammar  representation 164  of 'rime': v, c, e  Figure 7-18: Representation of 'vc + e' graphemes 165  Figure 7-19: Tree  representation  of  part  of  the 166  phoncode dictionary