@part(discuss,root "thesis.mss") @chapter(Performance) @label(perform) @section(Introduction) @label(perform-intro) In this chapter the performance of the editcost and phoncode programs, in correcting the errors made by the children, is assessed. If the programs are able to provide correction of the errors, then this provides evidence that: @begin(alphabetize) there are regularities in the children's errors; information relating to these regularities can be used by the programs to reconstruct the correction from the error. @end(alphabetize) Where there is failure to correct an error, this can be attributed to one or more of the following: @begin(alphabetize) the errors are not sufficiently regular; the programs do not have sufficient information about the regularities of the errors i.e. the grammar or weightings are incomplete or incorrect; the algorithm fails: sufficient information about existing regularities may be supplied to the program, but still there is failure to reconstruct the error. @end(alphabetize) Relating to these possible sources of error, the following question is also considered: @begin(itemize) Is a human judge able to perceive regularities in the errors, and would he/she then be able to provide corrections? @end(itemize) The editcost and phoncode programs are each considered in relation to the following questions: @begin(enumerate) Does the program succeed in correcting the errors made? If there is a failure, is it due to: @begin(alphabetize) the errors being irregular, the program data being insufficient or incorrect, the methods of analysis being unsuitable? @end(alphabetize) When the programs succeed, what does this tell us about @begin(alphabetize) the individual children the methods of correction? @end(alphabetize) @end(enumerate) @newpage @section(Performance of the editcost program) @label(perform-editc) The performance of the editcost program was initially assessed on two sets of data: @begin(enumerate) The words used with the editcost program in study 2 (S2); The complete set of errors made in studies 1 and 2 (S1, S2). @end(enumerate) @subsection(Testing editcost in use - Study 2) @label(peditc-inuse) The editcost program was used in study 2 (S2), as described in chapter @ref(assumptions). Each child used the program whenever he wished to check the spelling of a word (i.e. the input word). In some cases the word that was checked was correctly spelt: in other cases it was misspelt. It was compared with the set of words shortlisted from the dictionary. The dictionary consisted of the words in the generaldict, and the topic dictionary words for the particular session. The four words with lowest minimum repair cost were found and offered as possible corrections. Whenever a word was checked, the outcome could be categorised in one of three ways: @begin(romanize) the correction for the input word was both in the dictionary and offered as a possible correction; the correction was in the dictionary, but was not offered as a possible correction; the correction was not in the dictionary. @end(romanize) The frequency of occurrence for each category i, ii, iii, for each group of children taking part in S2 is given in figure @ref(pedit-one). Group 1 comprises FR, DV, TE and DR; group 2, DI, MA, GR and ST. @begin(figure) @begin(verbatim) i ii iii iv correction correction correction total in dictionary in dictionary not in the and offered not offered dictionary Group 1 229 27 55 311 Group 2 77 3 39 119 Both groups 306 30 94 430 @end(verbatim) @caption(Editcost in use: outcomes of checking) @tag(pedit-one) @end(figure) These results can be re-expressed as percentages. @begin(verbatim) Percentage correction offered, overall: i/total group 1 group 2 both groups 73.6% 64.7% 71.2% Percentage correction offered, when in the dictionary: i/(i+ii) group 1 group 2 both groups 89.5% 96.3% 91.1% and from this, percentage correction @b(not) offered when in the dictionary: ii/(i+ii) group 1 group 2 both groups 10.5% 3.7% 8.9% Percentage of corrections not in the dictionary: iii/iv group 1 group 2 both groups 17.7% 32.8% 21.9% @end(verbatim) From these results it can be seen that the program was able to offer the correction for a large percentage (>90%) of the words checked, assuming that they were in the dictionary. The correction algorithm was more successful for group 2 than for group 1 (96% vs. 89% corrections). However, group 2 attempted to check the spelling of the larger percentage of words that were not in the dictionary. These results may also be considered for individual children, as in figures @ref(pedit-offered) and @ref(pedit-notoff). @begin(figure) @begin(verbatim) i i/iv i/(i+ii) iv number % of the % of those in total number corrected total the dictionary checked Group 1 FR 69 71.9% 92% 96 DV 50 82% 87.7% 61 TE 72 74.2% 90% 97 DR 38 67.7% 86.4% 57 Group 2 DI 10 66.7% 100% 15 ST 20 62.5% 100% 32 MA 27 65.9% 96.4% 41 GR 20 64.5% 90.9% 31 Total 306 430 @end(verbatim) @caption(Editcost in use: individual results Correction offered) @tag(pedit-offered) @end(figure) @begin(figure) @begin(verbatim) ii iii iv correction in correction not total the dictionary in the dictionary number freq % of total freq % of total checked Group 1 FR 6 6.3% 21 21.9% 96 DV 7 11.5% 4 6.6% 61 TE 8 8.2% 17 17.5% 97 DR 6 10.5% 13 22.8% 57 Group 2 DI - 0% 5 33.3% 15 ST - 0% 12 37.5% 32 MA 1 2.4% 13 31.7% 41 GR 2 6.5% 9 29% 31 Total 30 94 430 @end(verbatim) @caption(Editcost in use: individual results Correction not offered) @tag(pedit-notoff) @end(figure) Results for groups 1 and 2 were compared using the Mann-Whitney U test (one-tailed). This test was also used to assess differences in performance of groups 1 and 2 in the first study, and for all other group comparisons in this chapter. From figure @ref(pedit-offered) it can be seen that group 2 showed a higher percentage correction of words in the dictionary than group 1 (p<0.05). The number of corrections offered, taken as a percentage of the total number of words checked, was higher for group 1 (p<0.02). Within groups there is little difference shown in the percentage of errors corrected (the range being <10%). Between groups there is less than 15% difference between the highest (100% for ST and DI) and the lowest (86.4% for DR). For the majority of cases where the correction was not offered, (c. 75%) the correction was not in the dictionary (see figure @ref(pedit-notoff)). The exception to this was errors made by DV - however, he showed the highest correction success rate overall. Group 1 had a significantly greater percentage of errors that were in the dictionary and not corrected (p<0.02) than group 2; group 2 showed a greater percentage of errors for which the correction was not in the dictionary (p<0.02). The possible corrections were ordered by cost, the lowest cost being offered first. The intended word, if it was included in the possible corrections, could be the first word offered (off(1)) or in the second, third or fourth positions (off(2/3/4)). The corrections offered were categorised according to whether they were off(1) or off(2/3/4). For each group the percentage of first words offered was: @begin(verbatim,group) Group 1 Group 2 Both groups off(1) 76.4% 97.4% 81.7% off(2/3/4) 23.6% 2.6% 18.3% @end(verbatim) For group 1 the intended correction was the first word offered in three-quarters of cases. For group 2 it was off(1) in more than 97% of cases. Overall, in more than four-fifths of cases the intended correction was offered as the possible correction with the least cost repair. On a number of occasions, if a word was checked and the correction not offered the child was encouraged to re-check it with a different spelling: "One closer to the correct word". These rechecks are included in the above categories, according to their outcomes. For group 1, 21 of the 27 words (in category ii) were rechecked with a different spelling. For 19 of these, the correction was found and offered. For group 2, for all 3 words in category ii, the correction was offered when rechecked. So, for the combined groups, of the 30 words for which the correction was not offered, 24 were rechecked with different spellings, 22 of these rechecked words produced the required spelling. When the required word was not in the dictionary, the investigator could be asked to add it. The initial spelling could then be rechecked. Twenty-six words were added and rechecked, 11 from group 1 and 15 from group 2. With the exception of 1 word from group 2, the corrections were offered for all added and rechecked words. The words that were not corrected successfully by the algorithm are discussed in more detail at the end of this section, in subsection @ref(pedit-disc12). @subsection(Testing on the corpus containing Study 1 and Study 2 errors) @label(pedit-test12) The editcost program was tested on the corpus of errors made by the children in both studies. These errors included those checked with the editcost program (S2), those made when writing (S2), and those made when typing (S1 and S2). Chapter @ref(assumptions) gives details of the two studies. The dictionary used was set up specifically for testing. Whilst the dictionary that had been used in each of the S2 sessions contained 750 to 1000 words, the testing dictionary contained more than 2000 words. It comprised the general dictionary, plus all the topic dictionaries and all the corrections of errors (with duplicates removed). Each error was checked using the editcost program, and the five dictionary words with lowest minimum cost repair were recorded. The reason for recording the fifth word was to test whether the performance would be substantially improved if it was included in the possible corrections. In only two cases in S1 and 15 cases in S2 was the correction the fifth option. This represents 2% of the total number of errors. In these results the correction offered as the fifth option is not counted as a success. For each child the following information was recorded: @begin(alphabetize) the number of errors for which the correction was offered; the percentage of errors for which the correction was offered; the number of the corrections that were offered as first option (off(1)); the percentage of the corrections that were offered as first option; the total number of errors made. @end(alphabetize) The results of testing the errors made in S1 are given in figure @ref(pedit-ps1). @begin(figure) @begin(verbatim) a. b. c. d. e. number % number % of a. total corrected corrected off(1) off(1) number Group 1 GQ 15 100% 13 86.6% 15 JM 38 100% 33 86.8% 38 MN 29 82.9% 24 82.8% 35 Group 2 LB 24 92.3% 22 91.7% 26 NM 30 90.9% 28 93.3% 33 CM 46 76.7% 34 73.9% 60 SS 18 64.3% 14 77.8% 28 Group 1 82 93.2% 70 85.4% 88 total Group 2 118 80.3% 98 83.1% 147 total Both 200 85.1% 168 84% 235 groups @end(verbatim) @caption(Editcost tested on Study 1 errors) @tag(pedit-ps1) @end(figure) The program offered the correction for 85% of errors, over both groups. 93.2% of errors made by group 1 were corrected, whilst 80.3% of corrections were offered for group 2 (not significant). It was least successful for CM and SS, offering only 64% of corrections in the case of SS (the reasons for this failure are discussed in section @ref(pres-indiv)). It was most successful for GQ and JM, providing 100% correction. In 84% of cases where the correction was offered it was the first option i.e. it had lowest edit cost. Note that the program weightings were based on the frequency of errors made by this group, and therefore a high percentage of corrections offered was to be expected. The results of testing the errors made in S2 are given in figure @ref(pedit-ps2). @begin(figure) @begin(verbatim) a. b. c. d. e. number % number % of a. total corrected corrected off(1) off(1) number Group 1 FR 103 83.7% 81 78.6% 123 DV 65 66.3% 39 60% 98 TE 106 80.9% 66 62.3% 131 DR 55 63.2% 41 74.5% 87 Group 2 GR 45 81.2% 34 61.8% 55 DI 21 95.5% 19 86.4% 22 MA 39 92.9% 33 78.6% 42 ST 39 92.9% 33 78.6% 42 Group 1 329 74.9% 227 69% 439 total Group 2 144 89.4% 119 82.6% 161 total Both 473 78.8% 346 73.2% 600 groups @end(verbatim) @caption(Editcost tested on Study 2 errors) @tag(pedit-ps2) @end(figure) The same information is given, as for S1. Corrections were offered for nearly 79% of errors, over both groups. The first option offered was the correction in 73% of errors overall. @subsection(Errors which the editcost program failed to correct) @label(pedit-disc12) The errors for which the editcost program did not offer corrections will now be considered, and reasons for this failure discussed. The sets of errors on which the program failed are given in figures @ref(errors-useS2) (use by S2), @ref(errors-editS1) (testing on S1 errors), @ref(errors-editS21) (testing on S2, group 1 errors) and @ref(errors-editS22) (testing on S2, group 2 errors). @newpage @begin(fullpagefigure) @begin(verbatim) FR DV eyes irs brown blounm eyes ias hair hear head hard of ove saw sore buried beray saw sour of ovre about ubout dalgleish dugle TE magazine magen gold goib DR through thr strachan stacking through thro strachan cracking called golld instructions inchuns bunny bune instructions chuns any ene turtle trener conservative cunjnc turtle turend conservative sevter GR MA computer ucnputer perq pirck computer unconputer DI and ST - no uncorrected errors @end(verbatim) @caption(Using editcost - Study 2 Errors for which correction not offered) @tag(errors-useS2) @begin(verbatim) MW NM won one paw po threw through change caing a are we wer change gh CM wrote nrote called colde wrote krote commercial commrs LB university ynusty quarry qorie can came fool full draw droy SS that ther new neea recall tecall had hat change calde draw john night nairt make mosea the whe make msea haunted hoted time the through thro get cedt came cane television tahgfring hear haes tv talhfi horror horey GQ and JM - no uncorrected errors @end(verbatim) @caption(Testing editcost - Study 1 Errors for which correction not offered) @tag(errors-editS1) @end(fullpagefigure) @begin(fullpagefigure) @begin(verbatim) Group 1 FR TE hair hera, hare weight wait eyes irs, ias gold goib head hard, herd through thro,thr silver isilver came gam saw sore,sour called golld where warh come conn of fo seen cn, cen piece pees, peces, motor moterdf peesc dangerously bandrie about ubout bunny bune showed sods, shodes, conservative cunjnc,sevter sodes plastic plasek wyse wizes of over put pit work wrk any ane DV would wob eyes liss, isse we wie brown bloum, blounm had thad hair hear just tust bye bi walked workt island illing took tike of over, ove, dark barck ov, ovre buried beray DR ghost goss have haft, half dalgleish dugle light like have uve the then for of check pellrs goodbye boodbye packed park,par,part, magazine magen parck, pakt interview intovue won win soldiers soildde strachan stacking, plastic plaiked cracking stairs stared, stare nicholas nickris,nickis down dame turtle turned, trener pictures pieces turend could cood stadium stamun bit bid brazil brasur talk tock picture ping drill drule instructions inchins,incruns more mor chuns, inchuns photo fot, front robot romdt, rodet, white withe rodert, roder, romdert dead beb dog bog @end(verbatim) @caption(Testing editcost - Study 2, group 1 Errors for which correction not offered) @tag(errors-editS21) @end(fullpagefigure) @begin(figure) @begin(verbatim) Group 2 GR DI called could specifications spec straight strat try trie MA who how perq purk, pirck uses yous alternatives alteration a and put pit ST any ena tune chune computer unconputer, procedures prgrame ucnputer so sow @end(verbatim) @caption(Testing editcost - Study 2, group 2 Errors for which correction not offered) @tag(errors-editS22) @end(figure) An error for which the editcost program does not offer the correction will be referred to as a @b(non-corrected error). The set of non-corrected errors resulting from the use of the program in S2 is a subset of those resulting from testing all the S2 errors, and so this subset will not be considered separately. Non-correction of an error indicates the inability of the program to reconstruct the correction from the error. This could be due to: @begin(enumerate) errors being so irregular that the correction cannot be inferred; program data being incomplete or incorrect, that is: @begin(alphabetize) omission of the correction in the shortlisting process; the weightings used being inappropriate; the costing function being inappropriate; @end(alphabetize) the description of the errors in terms of format (and hence analysis in terms of edit operations) being inadequate. @end(enumerate) The latter two possible causes of failure will be considered first. Inclusion of a dictionary word in the shortlist, for consideration by the costing algorithm, was dependent upon the length and first character(s) of the word. In a number of cases, the desired correction was omitted from the shortlist. Non-correction of the misspelling is attributable to a failure in shortlisting for: @begin(verbatim) 9 out of 35 non-corrected errors in S1 24 out of 127 non-corrected errors in S2 33 out of 162 non-corrected errors in total @end(verbatim) If further alternatives are permitted for first letter confusions, more corrections could be included in the shortlist. For example, alternatives a for u (=a/u), e/i, g/b, wh/ho, t/ch, h/th would reduce the omissions from the shortlist by 6. Additionally, if a difference of 4 characters is permitted between word and error, for words of less than 10 characters, then a further 5 words would be shortlisted. However, the program does succeed in providing the correction for 85.1% of S1 errors and for 78.8% of S2 errors when tested, and for 91.1% of S2 errors checked (for which the correction is available) when the program is in use (see section 8.2.1): more than 80% of errors tested overall. For a large number of errors, therefore, it seems that their description in terms of format, assignment of weightings and calculation of costs, is sufficient to enable reconstruction. It may be that some of the spellings are so bizarre that they conform to no apparent pattern: the correction will not be recognizable from the error. To test this, the set of non-corrected errors (for S1 and S2) was given to an independent judge for correction. The judge was asked to write what he thought would be the correction for each misspelling alongside it; to mark with a tick any word that he thought was spelt correctly (i.e. words misspelt as other words would be marked); to mark with a cross any word for which he could suggest no correction. Having corrected or marked all words presented, the judge was then told that, in fact, all the words were misspellings. He was then asked to write alongside each ticked word (apparently correct words) what he thought the spelling could be (knowing that it was not the word given). The judge's corrections were then compared with the intended corrections, and all discrepancies noted. If the judge had succeeded in correcting all the errors, where the editcost program failed, this would suggest that improvements of the program were needed. On the other hand if the judge failed to correct the majority of errors (i.e. they were unrecognizable) then this would indicate too little consistency, or lack of identifiable pattern, in the errors made. That more than 80% of errors were successfully corrected, by the program, indicates that there is an identifiable pattern in the majority of errors. It might be argued that the judge might fail to correct the errors because of unfamiliarity with the vocabulary used by the children in the two studies. This was overcome by using the same judge who had already seen all sets of error-correction pairs (see subsection @ref(pphon-disc12)). This meant that the judge had seen all the errors before, with their corrections, though in a different order (errors were presented in a random order). He was also reminded of the topics dealt with in the children's writing. Despite this, he failed to recognize a substantial number of errors, though he did indicate that his previous experience had slightly influenced the corrections offered. Outcomes of comparison of the judge's corrections and the intended corrections are classed as follows: @begin(enumerate) the correction provided by the judge was the intended word (=C) no correction could be suggested (=NC) the wrong correction was suggested (=WC) the misspelling was taken as the correct spelling of another word initially, but was later reconsidered and classed in one of the above categories (=IC, INC, IWC) @end(enumerate) A summary of the results is given, for each group, in figure @ref(boog-editsum). The total frequencies for the categories C, NC, and WC are given. Included are those errors initially thought to be correct (the frequency of which are given in brackets, for each category). @begin(figure) @begin(verbatim) No Wrong Correction Correction Correction Total C NC WC Study 1 Group 1 2(1) 2(0) 2(2) 6(3) Group 2 5(2) 9(1) 15(1) 29(4) S1 total 7(3) 11(1) 17(3) 35(7) Study 2 Group 1 29(6) 38(11) 43(7) 110(24) Group 2 12(2) 3(3) 2(0) 17(5) S2 total 41(8) 41(14) 45(7) 127(29) Total 48(11) 52(15) 62(10) 162(36) @end(verbatim) @caption(Comparison of judge's corrections with intended corrections - summary) @tag(boog-editsum) @end(figure) Results are also given, for each child, in figures @ref(boog-edit1) and @ref(boog-edit2). @begin(fullpagefigure) @begin(verbatim) C IC NC INC WC INC Total Group 1 GQ - - - - - - 0 JM - - - - - - 0 MW 1 1 2 - - 2 6 Group 2 LB - 1 - - 1 - 2 NM - - 1 - 2 - 3 CM 3 - 3 - 7 1 14 SS - 1 4 1 4 - 10 Group 1 1 1 2 0 0 2 6 total Group 2 3 2 8 1 14 1 29 total Both 4 3 10 1 14 3 35 groups @end(verbatim) @caption(Comparison of judge's corrections with intended corrections - Study 1) @tag(boog-edit1) @begin(verbatim) C IC NC INC WC INC Total Group 1 FR 7 2 3 - 4 4 20 DV 7 1 5 5 12 3 33 TE 6 1 10 1 7 - 25 DR 3 2 9 5 13 - 32 Group 2 GR 5 2 - 2 1 - 10 DI 1 - - - - - 1 MA 2 - - 1 - - 3 ST 2 - - - 1 - 3 Group 1 23 6 27 11 36 7 110 total Group 2 10 2 0 3 2 0 17 total Both 33 8 27 14 38 7 127 groups @end(verbatim) @caption(Comparison of judge's corrections with intended corrections - Study 2) @tag(boog-edit2) @end(fullpagefigure) The judge corrected 29.6% of the non-corrected errors. He failed to offer a correction for 32.1% of the errors and offered alternatives for 38.3%. Of those corrected 11 had initially been believed to be alternative words, spelt correctly, and were left uncorrected. At first attempt, therefore, only 22.8% of errors were successfully corrected. Overall the judge failed to correct 70.4% of errors. Thus for 70.4% of errors that the program failed to correct, the human judge also failed to identify the correction; despite knowing that all the words were errors and having previously seen the error/correction pairs. Additionally, of the 33 words that the program failed to shortlist, 20 presented difficulty to the judge. The judge experienced particular difficulty with the errors made by CM and SS (S1, group 2) and by DV, TE and DR (S2, group 2): he failed to correct between 72% and 90% of them. This suggests that these errors were in some way unrecognizable. Summarising the results for the editcost program overall: @begin(itemize) the program succeeded in correcting @begin(alphabetize) 85.1% of errors made by Group 1, when tested; 78.8% of errors made by Group 2, when tested; 91.1% of errors made by Group 2 (for which the correction was available) when the program was in use; 80.6% of errors tested (a + b) overall. @end(alphabetize) of those it failed to correct (162 errors) @begin(romanize) 48 were corrected by the judge (therefore attributable to failure on the part of the program), accounting for 5.7% of errors overall; 114 were not corrected by the judge (therefore attributable to insufficient regularities shown in the errors), accounting for 13.7% overall. @end(romanize) @end(itemize) @newpage @section(Performance of the phoncode program) @label(perform-phonc) @subsection(Testing on Study 1 and Study 2 errors) @label(pphon-test12) The performance of the phoncode program was assessed on the sets of errors made in S1 and S2. The same testing dictionary was used for testing both the phoncode and editcost programs. The dictionary was coded phonemically for testing with the phoncode program (see chapter @ref(detail), section @ref(dict-phon)). Each error was input to the phoncode program. Words offered by the program as `phonetic equivalents' were recorded. Whether or not the correction for the error was included in these words was noted. The following information was obtained: @begin(alphabetize) the number and percentage of errors for which the correction is included in the words offered by the program; the number and percentage of errors for which the correction is not offered; the total number of errors made. @end(alphabetize) The results of testing the errors in S1 are given in figure @ref(pphon-ps1). @begin(figure) @begin(verbatim) a. b. c. correction correction not total included in included in number of words offered words offered errors freq % freq % Group 1 GQ 10 66.7% 5 33.3% 15 JM 31 81.6% 7 18.4% 38 MN 21 60.0% 14 30.0% 35 Group 2 LB 23 88.5% 3 11.5% 26 NM 21 63.7% 12 36.3% 33 CM 30 50.0% 30 50.0% 60 SS 8 28.6% 20 71.4% 28 Group 1 62 70.5% 26 29.5% 88 Group 2 82 55.8% 65 44.2% 147 1 & 2 144 61.3% 91 38.7% 235 @end(verbatim) @caption(Phoncode tested on study 1 errors) @tag(pphon-ps1) @end(figure) The percentage of errors for which the correction is included in the words offered, for all children, is 61.3%. The overall percentage for group 1 is higher than that for group 2, though the difference is not statistically significant. The lowest percentage offered is 28.6% for SS (more than 20% lower than for any other child). CM is next lowest with 50% corrected. MN, NM and GQ all fall in the 60 to 67% range. The highest percentage corrections are for LB and JM, with 88.5% and 81.6% respectively. Information is given for each child, and for each group of children. The results of testing errors made in S2 are given in figure @ref(pphon-ps2). The same information is provided for this group. @begin(figure) @begin(verbatim) a. b. c. correction correction not total included in included in number of words offered words offered errors freq % freq % Group 1 FR 78 63.4% 55 36.6% 123 DV 43 43.8% 45 56.2% 98 TE 71 54.2% 40 45.8% 131 DR 30 34.5% 57 65.5% 87 Group 2 GR 35 63.6% 20 34.4% 55 DI 16 72.7% 6 27.3% 22 MA 33 78.6% 9 21.4% 42 ST 29 69.0% 13 31.0% 42 Group 1 222 50.5% 217 49.5% 439 Group 2 113 70.2% 48 29.8% 161 1 & 2 335 55.9% 265 44.1% 600 @end(verbatim) @caption(Phoncode tested on study 2 errors) @tag(pphon-ps2) @end(figure) The overall percentage correction for both groups is 55.9%. Group 2 all have higher percentage corrections than Group 1: the Group 2 total is 70.2%, while that for Group 1 is 50.2% (p<0.02). The percentage corrected, for all children, ranges from 35.5% to 78.6%, distributed fairly evenly through the whole range. @newpage @subsection(Errors which the phoncode program failed to correct) @label(pphon-disc12) Of the 835 misspellings made overall, the phoncode program failed to correct 356 (42.6%). This failure may be attributed to one or more of the following: @begin(enumerate) the misspellings and corrections were not "phonetically equivalent"; the program failed to find the "phonetically equivalent" correction for the misspelling, due to: @begin(romanize) the phoneme-grapheme grammar being incorrect or incomplete; the segmentation algorithm being incorrect; the words being incorrectly coded in the phonetically coded dictionary. @end(romanize) @end(enumerate) In order to determine which of the misspellings might be considered phonetic and which non-phonetic, a judge was used to classify them. This was the same person who was later used to judge the errors that the editcost program failed to correct (see section 8.2.3). The judge was a male Scottish teacher, with a knowledge of linguistics. He was very familiar with the dialect used by the children in the two studies. The judge was given the complete set of misspellings and corrections, for both sets of children. He was asked to look at each misspelling/correction pair and to decide whether or not they could be considered to be phonetically equivalent: if both were read aloud would they be indistinguishable. After a practice on a set of 'misspellings' and 'corrections' taken from Cohen (1984), the definition was further refined to "both spellings being interpreted as the same word by a local native speaker, when read aloud; the pronunciation of misspellings to be determined by the common pronunciation of graphemes in different contexts". The judge, therefore, was permitted to consider the same misspelling as having more than one pronunciation. Each error was marked by the judge as either phonetic or non-phonetic. The results of this classification and those of the phoncode program were compared. These results were classified in the following categories: @begin(alphabetize) correction included in words offered and judged to be 'phonetic' (C/Ph) = agreement; correction included in words offered and error judged to be 'non-phonetic' (C/NPh) = disagreement; correction not included in words offered and error judged to be 'phonetic' (NC/Ph) = disagreement; correction not included in words offered and error judged to be 'non-phonetic' (NC/NPh) = agreement; total number of errors @end(alphabetize) Results of the comparison of the judge's classification and the program performance are given in figures @ref(phoncomp-ps1) and @ref(phoncomp-ps2). @begin(figure) @begin(verbatim) a b c d e C/Ph C/NPh NC/Ph NC/NPh total (% of (% of (% of (% of number of total) total) total) total) errors Group 1 GQ 6 4 0 5 15 (40%) (26.7%) (0%) (33.3%) JM 22 9 0 7 38 (57.9%) (23.7%) (0%) (18.4%) MW 16 5 1 13 35 (45.7%) (14.3%) (2.9%) (37.1%) Group 2 LB 14 9 0 3 26 (53.9%) (34.6%) (0%) (11.5%) NM 17 4 1 11 33 (51.6%) (12.1%) (3%) (33.3%) CM 15 15 4 26 60 (25%) (25%) (6.7%) (43.3%) SS 2 6 2 18 28 (7.1%) (21.5%) (7.1%) (64.3%) Group 1 44 18 1 25 88 total (50%) (20.5%) (1.1%) (28.4%) Group 2 48 34 7 58 147 total (32.7%) (23.1%) (4.8%) (39.4%) Both 92 52 8 83 235 groups (39.2%) (22.1%) (3.4%) (35.3%) @end(verbatim) @caption(Comparison of errors corrected by the Phoncode program with those judged to be 'phonetic' - Study 1) @tag(phoncomp-ps1) @end(figure) @begin(figure) @begin(verbatim) a b c d e C/Ph C/NPh NC/Ph NC/NPh total (% of (% of (% of (% of number of total) total) total) total) errors Group 1 FR 52 26 12 33 123 (42.3%) (21.1%) (9.8%) (26.8%) DV 31 12 3 52 98 (31.6%) (12.2%) (3.1%) (53.1%) TE 42 29 7 53 131 (32.1%) (22.1%) (5.3%) (40.5%) DR 20 10 4 53 87 (23%) (11.5%) (4.6%) (60.9%) Group 2 GR 25 10 5 15 55 (45.4%) (18.2%) (9.1%) (27.3%) DI 14 2 1 5 22 (63.6%) (9.1%) (4.6%) (22.7%) MA 28 5 0 9 42 (66.7%) (11.9%) (0%) (21.4%) ST 20 9 3 10 42 (47.6%) (21.4%) (7.2%) (23.8%) Group 1 145 77 26 191 439 total (33%) (17.5%) (6%) (43.5%) Group 2 87 26 9 39 161 total (54%) (16.2%) (5.6%) (24.2%) Both 232 103 35 230 600 groups (38.7%) (17.2%) (5.8%) (38.3%) @end(verbatim) @caption(Comparison of errors corrected by the Phoncode program with those judged to be 'phonetic' - Study 2) @tag(phoncomp-ps2) @end(figure) For Study 1, group 1, the agreement between the program and the judge is 78.4% (=a + d =50% + 28.4%) and 72.1% for group 2 (=32.7% + 39.4%): that is, 74.5% (=39.2% + 35.3%) overall. Groups 1 and 2 showed no significant differences when compared in any of the categories (a),(b),(c),(d). Most disagreement between judge and program outcome occurred in the C/NPh category (22.1%); misspellings classed as non-phonetic by the judge were corrected by the program. Only 1.1% of group 1 errors and 4.8% of group 2 errors (3.4% or 8 errors overall) were classed as phonetic but not corrected. Of the errors made, 39.2% were both classed as phonetic (by the judge) and corrected (by the phoncode program). For Study 2, 77% agreement is shown between judge and program (group 1 - 76.5%; group 2 - 78.2%). Groups 1 and 2 differed on the frequency of errors classed in categories (a) and (d): group 2 had more errors classed as phonetic and corrected than group 1 (p<0.02) and fewer non-phonetic and non-corrected errors (p<0.05). No significant differences were shown between the two groups in the categories for which judge and program disagreed. Over the two groups 5.8% of errors were classed as phonetic but not corrected (group 1 - 6%; group 2 - 5.6%). In all, 38.7% of errors were classed as phonetic and corrected, with a further 17.2% corrected. The combined figures for both studies give 76.3% agreement between the program and the judge. 38.8% of errors were judged to be phonetic, and were corrected, with a further 18.6% corrected (but judged to be non-phonetic). 37.5% were judged to be non-phonetic and were not corrected by the phoncode program. Only 5.1% were judged to be phonetic but not corrected. The reasons for the failure of the phoncode program will be considered. The program was not designed to correct non-phonetic errors, thus a large percentage of the misspellings (37.5%) were classed as non-phonetic and were not corrected. There were 43 misspellings, judged to be phonetic, which the program failed to correct (NC/Ph). These are listed in figure @ref(phonboth). @begin(figure) @begin(verbatim) Study 1 MW NM won one sounds souns SS CM get cedt picture picher easter eastr buttons butns buttons buttns castle castl Study 2 FR TE blood plood police plec treasure tresher seen cn diamonds dimens dangerously dangersly jewels jouls ireland irlnd using yoosing thatcher thacher magazine magzine work wrk magazine magzeen if ifh computer compyooter DV how howe goals gois put pit score scorre chemical cemikle picture pichur plans plandes DR GR took toog university univesty picture picher boxes boxs kitchen kitshen used yoosed has his alphabet alphapet ST put pit government goverment DI programmes progames designed designned three theree @end(verbatim) @caption(Errors judged to be phonetic, but not corrected: S1 and S2) @tag(phonboth). @end(figure) As stated above, the failure may be attributed to incomplete or incorrect grammar; incorrect segmentation; or incorrect coding of the dictionary. The difficulties of segmentation and coding are discussed in chapter @ref(detail), subsections @ref(phon-graph), @ref(dict-phon). Examples of segmentation@foot(A segmentation error is one where the misspelling is split into graphemes in such a way that it cannot be matched to the phoneme string representing the correction.) errors are: @begin(verbatim,group) y = /y/ u = /ju/ d = /d/ d = /d/ oo = /u/ s = /z/ e = /I/ e = /I/ s = /z/ i = /I/ s = /z/ s = /z/ i = /I/ ng = /ng/ i = /aI/ i = /aI/ ng = /ng/ g = ? gn = /n/ nn = /n/ ed = /d/ ed = /d/ @end(verbatim) @begin(verbatim,group) k = /k/ k = /k/ s = /s/ s = /s/ i = /I/ i = /I/ c = /k/ c = /k/ t = /t/ tch = /ch/ o = /o:/ o_e = /o:/ sh = /sh/ e = /E/ rr = /r/ r = /r/ e = /E/ n = /n/ e = ? n = /n/ @end(verbatim) The phoneme-grapheme grammar failed to provide matches in a number of cases, though for some of them their classification as 'phonetic' errors might be disputed. Examples of these are: @begin(verbatim) get cedt blood plood put pit took toog has his alphabet alphapet @end(verbatim) Other classes of errors that presented difficulties include: @begin(alphabetize) omitted schwa, particularly before n and l @* e.g. buttons buttns police plec other omitted vowels @* e.g. boxes boxs chemical cemikle errors involving 'r' @* e.g. easter eastr picture picher consonant confusions, particularly involving 'd', 't', 'ch' @* e.g. get cedt picture picher consonant omissions, particularly d after n @* e.g. sounds souns diamonds dimens @end(alphabetize) The other set of misspellings that judge and program disagreed on were those judged as non-phonetic, but corrected by the phoncode program. A large number of these were vowel confusions accepted as equivalent by the phoncode grammar but rejected by the judge. Additionally, other classes of errors accepted by the program, but considered 'non-phonetic', were: @begin(itemize) errors involving 'r' (and vowel); final 'e' (omitted and added); transpositions, in particular 'ed/de' and 'es/se' and vowels; incorrectly doubled or singled consonants, in particular 'n' before 'g' or 't', and 'l' before 'k' or 'd'; errors involving 'h' (usually silent) @end(itemize) For some of these misspellings, the alteration of a grapheme from 'tied' to 'untied' would enable them to be corrected, and matched to their phonetic equivalent. For a number of others, in particular those involving omission of an unstressed vowel, the program would need to be altered to take them into account. Summarising the results for the phoncode program overall: @begin(itemize) the program succeeded in correcting @begin(alphabetize) 61.3% of errors made by Group 1, when tested; 55.9% of errors made by Group 2, when tested; 57.4% of errors tested overall. @end(alphabetize) of those it failed to correct (356 errors) @begin(romanize) 43 were judged to be phonetic (therefore attributable to failure on the part of the program), accounting for 5.1% of errors overall; 313 were judged to be non-phonetic, 37.5%; @end(romanize) additionally, 38.8% of misspellings were both judged to be phonetic and corrected by the phoncode program. @end(itemize) @newpage @section(Results for combined programs) @label(pres-combined) @comment[ - performance, and how phoncode improved results - words they both failed to get - best to use both?] The results of testing the performance of the two programs, on the sets of misspellings from the two studies, were combined. There was a large amount of overlap between the corrections. The results for each program and for the combined programs are given in figures @ref(editphon-ps1) and @ref(editphon-ps2). @begin(figure) @begin(verbatim) a b c d e corrected corrected corrected corrected total by by by by number of editcost phoncode neither combined errors Group 1 GQ 15 10 0 15 15 (100%) (66.7%) (0%) (100%) JM 38 31 0 38 38 (100%) (81.6%) (0%) (100%) MW 29 21 5 30 35 (82.9%) (60%) (14.3%) (85.7%) Group 2 LB 24 23 0 26 26 (92.3%) (88.5%) (0%) (100%) NM 30 21 2 31 33 (90.9%) (63.7%) (6.1%) (93.9%) CM 46 30 10 50 60 (76.7%) (50%) (16.7%) (83.3%) SS 18 8 9 19 28 (63.4%) (28.6%) (32.1%) (67.9%) Group 1 82 62 5 83 88 total (93.2%) (70.5%) (5.7%) (94.3%) Group 2 118 82 21 126 147 total (80.3%) (55.8%) (14.3%) (85.7%) Both 200 144 26 209 235 groups (85.1%) (61.3%) (11.1%) (88.9%) @end(verbatim) @caption(Comparison of errors corrected by Editcost and by Phoncode programs - Study 1) @tag(editphon-ps1) @end(figure) For each child, for each group, the following information is given: @begin(alphabetize) the number and percentage of errors corrected by the editcost program; the number and percentage of errors corrected by the phoncode program; the number and percentage of errors corrected by neither program; the number and percentage of errors corrected by either of the two programs; the total number of errors made. @end(alphabetize) @begin(figure) @begin(verbatim) a b c d e corrected corrected corrected corrected total by by by by number of editcost phoncode neither combined errors Group 1 FR 103 78 10 113 123 (83.7%) (63.4%) (8.1%) (91.9%) DV 65 43 22 76 98 (66.3%) (43.8%) (22.4%) (77.6%) TE 106 71 13 118 131 (80.9%) (53.2%) (9.9%) (90.1%) DR 55 30 26 61 87 (63.2%) (34.5%) (29.9%) (70.1%) Group 2 GR 45 35 6 49 55 (81.2%) (63.6%) (10.9%) (89.1%) DI 21 16 1 21 22 (95.5%) (72.7%) (4.5%) (95.5%) MA 39 33 1 41 42 (92.9%) (78.6%) (2.4%) (97.6%) ST 39 29 2 40 42 (92.9%) (69%) (4.8%) (95.2%) Group 1 329 222 71 368 439 total (74.9%) (50.5%) (16.2%) (83.8%) Group 2 144 113 10 151 161 total (89.4%) (70.2%) (6.2%) (93.8%) Both 473 335 81 519 600 groups (78.8%) (55.9%) (13.5%) (86.5%) @end(verbatim) @caption(Comparison of errors corrected by Editcost and by Phoncode programs - Study 2) @tag(editphon-ps2) @end(figure) For Study 1, the percentage correction for the combined programs is 88.9%. Of the 35 errors that the editcost program failed to correct, 9 errors were corrected by the phoncode program. The remaining 26 that neither program corrected include some that were neither corrected by the judge (in testing editcost) nor judged to be phonetic. Group 1 show a higher percentage correction in all categories than group 2, though none of the differences are significant. By combining the two programs the number of errors corrected is increased, for most children. GQ and JM are the exceptions with 100% correction using the editcost program alone. The increases vary from one additional correction (MW, NM, SS), to two (LB), to four (CM). For Study 2, the combined programs correct 86.5% of misspellings. 46 are corrected by the phoncode program that were not corrected by the editcost program, leaving 81 misspellings not corrected by either program. Group 2 show higher percentage corrections than group 1 for the individual programs (p<0.05 for editcost and p<0.02 for phoncode) but no significant differences for the combined programs. Improvements in the number of misspellings corrected vary from 0 (DI), 1 (ST) to 11 (DV), 12 (TE). The overall percentage correction by the combined programs is 87.2%. @comment[Of the 107 misspellings which were not corrected, XX were neither corrected by the judge nor considered to be phonetic.] @newpage @section(Results for individual children) @label(pres-indiv) In this section the performance of the spelling correction program, in relation to individual children is considered. The relationships between a number of measures was found by correlation of the rankings of individual children on performance measures. It was hypothesized that the children who made the most 'regular' errors, i.e. those who produced the fewest bizarre spellings, would also be those for whom the editcost and phoncode correctors would be most successful. Additionally, the errors that they make would be considered to be 'phonetic'. The children making the most 'regular' errors were those who were perceived as having the least difficulty. The children were ranked (roughly and subjectively, it should be noted) in terms of their spelling ability. This ranking was based on observation by the investigator and discussion with the Reading Unit teacher. For S1, the rough rankings in order of decreasing ability, were: @verbatim( GQ; JM; LB; MW; NM; CM; SS ) For S2 the rough rankings were: @verbatim( MA and DI; ST; GR; FR; DV; DR; TE ) The hypotheses tested were: @begin(enumerate) success of correction by the editcost and the phoncode programs would correlate; children whose errors were judged to be phonetic would also show greatest success with the phoncode program; the children ranked as most able would be those for whom the programs were most successful and whose errors were judged to be phonetic. @end(enumerate) For the children in each group, the relationships between the following measures were found using the Spearman Rank correlation coefficient. @begin(alphabetize) percentage correction by the editcost program (in testing); percentage of corrections that were off(1); percentage correction by the phoncode program; percentage of errors judged to be phonetic; percentage improvement of editcost results when both programs' results are combined. @end(alphabetize) The perceived rankings of the children's general spelling ability were not statistically correlated with these measures as they were considered to be too subjective and crude. They are, however, considered in relation to the results of these correlations. The measure of b) was included to test whether there was any relationship between the degree of success of the editcost program (where off(1) indicated greatest success) and other measures. Measure e) was included to further test the relationship between the editcost and phoncode programs' results. For all measures, percentages were of total number or errors made by each child (except (b), which was percentage of (a)). Significant correlations were found between a number of measures. These will be summarised and then discussed. For Study 1 @begin(verbatim) - correlation between a) and c) = .88 ( p < 0.05 ) b) and d) = .76 ( p < 0.05 ) c) and d) = .75 ( p < 0.05 ) @end(verbatim) For Study 2 @begin(verbatim) - correlation between a) and c) = .93 ( p < 0.01 ) c) and d) = .98 ( p < 0.01 ) a) and d) = .97 ( p < 0.01 ) - correlation between b) and e) = -.76 ( p < 0.05 ) b) and a) = .82 ( p < 0.05 ) b) and c) = .68 ( p < 0.05 ) b) and d) = .71 ( p < 0.05 ) e) and a) = -.72 ( p < 0.05 ) e) and c) = -.81 ( p < 0.05 ) e) and d) = -.74 ( p < 0.05 ) @end(verbatim) For children in Study 1, success of editcost and phoncode programs were correlated; as were success of phoncode program and percentage of errors judged to be phonetic, and percentage corrections offered as the first editcost option and percentage judged to be phonetic. Stronger correlations are shown for Study 2: performance of phoncode and editcost programs, and percentage of errors judged phonetic all correlate. Additionally, percentage of errors offered as first option correlated negatively with the percentage improvement made by the phoncode program when both program' results were combined: both of these correlate (the latter, negatively) with the three strongly correlated measures above. Therefore, in general it can be said that those children for whom the editcost program is successful, the phoncode program will also be successful. A large part of the failure of the editcost program can be attributed to unrecognizable errors. These children also make the fewest unrecognizable errors. The correlation between phoncode performance and judgement of phonetic errors suggests that those children for whom the phoncode program is most successful make the fewest non-phonetic errors. These relations are shown most strongly in the Study 2 children; a strong direct correlation is also shown between performance of editcost program and percentage of phonetic errors. For these children, the correlations also suggest that those with the most errors offered as first options also make most phonetic errors, and fewest non-phonetic errors. The negative correlation between measures e) and a) is to be expected: the more successful the editcost program is, the less scope there will be for improvement. The editcost program incorporates some information relating to phonetic equivalence of words (e.g. most likely substitutions are phonetically similar), hence the high correlation between measures a) and c) is also not surprising. Considering the individual performance rankings of the children, firstly for Study 1; group 1 were described as the more able students (see notes on children in appendix @ref(app-assum)), and group 2 as the "hopeless cases" (with LB as an addition to this group). From the performance rankings, GQ, JM and LB generally come out as the top group, with MW and NM as the middle group (except for a percentage of errors judged phonetic, where NM and GQ swop groups), and CM and SS as the least able, with the worst results for all measures. These rankings fit very well with the perceived abilities of the children. For Study 2, the performance rankings are even clearer: best ranked are DI, MA and ST, then FR and GR (where group 1 - "moderately able", and group 2 - "very bright", overlap), and finally TE, DV, and DR. Again, there is a good fit between rankings and perceived abilities, with the exception of TE who performs better than would be expected. In relation to the theoretical discussion of the stages of failure in the spelling process, various inferences may be made on the basis of these findings. Children who have the least difficulties are more likely to be failing at a later stage in the process than those who make a large number of bizarre and irregular spelling errors. If the former succeed at the 'selection of plausible graphemes' stage, but fail at the third stage, their errors will be phonetic. They are more likely to be using correspondences from the phoncode grammar and hence their errors will be corrected by the phoncode program. Their errors are occurring in the selection of orthographically correct plausible graphemes: information relating to the format is used by the editcost program to correct these successfully. It is expected that both editcost and phoncode programs will successfully correct the errors made by these children. In terms of absolute success, it can be seen from the results that the editcost program is clearly more successful. It is designed to cope with both phonetic and non-phonetic errors: hence its higher rate of success. Where there is failure at the first or second stage, that is, the graphemes selected to represent the speech sounds are not plausible, we would expect the phoncode program to fail. We would also expect a lower rate of phonetic errors. The editcost program is able to 'pick up' some of these non-phonetic errors: some are too irregular, however, and cannot be fitted into any general description of errors. Those children perceived as 'better spellers' showed more regularity in their errors, made fewer non-phonetic errors and were more likely to have their errors corrected successfully by both the editcost and phoncode programs. They were considered to be failing to select the correct grapheme from the plausible graphemes generated. The children perceived as least able showed more irregular errors and more non-phonetic errors. The editcost program was more successful for them than the phoncode program. Neither were as successful with these children as with the better spellers. Their failings occur at the first or second stage in the spelling process; that is, in the segmenting of the word into phonemes, or in the selection of plausible graphemes to represent each phoneme. Inferences cannot be drawn from these results to judge at which of the first two stages the failing is occurring. It might be inferred from these findings that success in correction by the phoncode program implies that a phonological strategy is being used by the child. Conversely, success in correction by the editcost program could be taken to suggest that a visuo-orthographic strategy is being employed. If this argument is accepted, the implication would be that those children for whom both programs are successful used both phonological and visuo-orthographic strategies in spelling. Following from this, it could be argued that the children for whom the phoncode program is comparatively less successful use predominantly visuo-orthographic stategies. There are no clear conclusions that can be drawn from the evidence presented here, however, for two reasons: @begin(enumerate) the editcost program incorporates a certain amount of phonological information in relation to likely errors: therefore, the success of the editcost program and the failure of the phoncode program does not necessarily imply that a phonological strategy is not being used; it is very difficult to assess "comparatively less successful": whilst the rankings on editcost and phoncode performance correlate highly, the absolute differences between percentages appear to bear little relation to these rankings. @end(enumerate) One conclusion that may be drawn is that the more able children appear to use both strategies with more success than the less able children. @newpage @section(Testing the programs on independent data) @label(frith-testing) The editcost and phoncode programs were also tested on data from an external source. These were a corpus of misspellings of thirty words produced by 202 ten-year old children, in a dictation test. The children were a random sample selected from a group of 15,000 children in English and Welsh schools. The data was made available to Roger Mitton (Birkbeck College, London) by Dr. Uta Frith (MRC Cognitive Development Unit, London). A copy of the corpus of misspellings was provided for testing in this thesis. The number of misspellings in the corpus is 2482. Of these 1364 are unique: the rest are the same misspelling made by more than one child. The set of unique misspellings will be referred to here as 'errors excluding repeats', whilst the full corpus will be referred to as 'errors including repeats'. The set of errors excluding repeats was used with the editcost and phoncode programs. The testing dictionary was that referred to elsewhere in this thesis (section @ref(pedit-test12)), with the addition of those of the thirty dictated words that were not already included. Results are given for each of the thirty words: figure @ref(frith-exclcop) shows the results for the errors, excluding repeats; figure @ref(frith-inclcop) shows correction of errors including repeats. @begin(figure) @begin(verbatim) a b c d e corrected corrected corrected corrected total by by by by number editcost phoncode either either of Words (off(1)) (number) (%) errors often 26 (19) 15 26 89.7% 29 visited 41 (33) 8 41 91.1% 45 aunt 14 (9) 4 14 66.7% 21 magnificent 83 (78) 21 83 82.2% 101 house 8 (5) 3 8 88.9% 9 opposite 56 (44) 30 58 79.4% 73 gallery 51 (26) 18 51 81% 63 remember 37 (31) 9 37 90.2% 41 splendid 33 (29) 9 33 58.9% 56 purple 24 (18) 12 25 75.8% 33 curtains 39 (32) 24 39 79.6% 49 wrote 13 (8) 7 14 56% 25 poetry 62 (46) 24 63 78.8% 80 problem 35 (30) 8 35 83.3% 42 understand 24 (20) 5 24 82.8% 29 latest 32 (27) 10 32 71.1% 45 poems 28 (23) 9 29 74.4% 39 wanted 10 (5) 4 11 52.4% 21 laugh 18 (9) 9 23 62.2% 37 pretend 45 (37) 10 45 81.8% 55 really 29 (19) 15 29 70.7% 41 special 74 (53) 32 74 85.1% 87 refreshment 53 (48) 16 53 81.5% 65 there 5 (2) 4 5 71.4% 7 blue 5 (4) 3 5 71.4% 7 juice 18 (14) 23 32 69.6% 46 cake 9 (6) 2 11 84.6% 13 biscuits 63 (55) 16 63 80.8% 78 stomach 62 (44) 44 67 79.8% 84 contented 37 (33) 9 37 86% 43 Total 1034 (807) 403 1067 78.2% 1364 @end(verbatim) @caption(Testing of the editcost and phoncode programs on independent data - excluding repeats) @tag(frith-exclcop) @end(figure) @begin(figure) @begin(verbatim) a b c d e corrected corrected corrected corrected total by by by by number editcost phoncode either either of Words (off(1)) (number) (%) errors often 51 (42) 32 51 91.1% 56 visited 93 (78) 24 93 93.9% 99 aunt 71 (63) 43 71 87.7% 81 magnificent 136 (131) 59 136 88.3% 154 house 14 (11) 9 14 93.3% 15 opposite 125 (109) 97 132 88% 150 gallery 101 (70) 58 101 88.6% 114 remember 90 (84) 13 90 94.7% 95 splendid 102 (96) 61 102 81.6% 125 purple 41 (35) 27 42 84% 50 curtains 76 (66) 57 76 88.4% 86 wrote 63 (55) 60 64 78% 82 poetry 91 (73) 37 92 84.4% 109 problem 66 (60) 15 66 90.4% 73 understand 29 (25) 5 29 85.3% 34 latest 47 (42) 14 47 77% 61 poems 64 (58) 19 65 86.7% 75 wanted 28 (17) 16 29 70.8% 41 laugh 28 (18) 30 50 76.9% 65 pretend 81 (69) 26 81 88% 92 really 110 (99) 94 110 90.2% 122 special 108 (87) 52 108 89.3% 121 refreshment 77 (72) 28 77 86.5% 89 there 19 (12) 19 19 54.3% 35 blue 11 (10) 3 11 84.6% 13 juice 54 (50) 64 74 83.1% 89 cake 13 (10) 5 18 90% 20 biscuits 126 (113) 65 126 89.4% 141 stomach 108 (77) 86 116 84.7% 137 contented 52 (47) 16 52 89.7% 58 Total 2075 (1779) 1134 2142 85.7% 2482 @end(verbatim) @caption(Testing of the editcost and phoncode programs on independent data - including repeats) @tag(frith-inclcop) @end(figure) Results of testing are given in the following categories: @begin(alphabetize) the number of errors for which the correction was offered by the editcost program (and the number for which these were the first offer, when offered); the number of errors for which the correction was offered by the phoncode program; the number of errors for which the correction was offered by either of the two programs; the percentage of the total number of errors for which the correction was offered, by either program (c/e); the total number of errors. @end(alphabetize) For 78.2% of the unique errors, and for 86.3% of the total number of errors, the correction is offered by either the editcost or the phoncode program. Of the errors corrected by the editcost program, 85.7% are offered as the first option i.e. the least cost repair, representing 71.7% of the total number of errors. As with the children in the two studies, the editcost program was more successful than the phoncode program. Because a large number of the errors made by the children would probably not be classed as phonetic, this was to be expected. Some failure could be attributed to the program, however. The words that the correctors failed on are not analysed in detail, though the discussion of failure in relation to the two studies is of relevance (see section @ref(pphon-disc12)). The phoncode program provided little improvement over the editcost program, except for the words 'laugh' and 'juice'. The combined programs failed to achieve 70% correction on unique misspellings of 'aunt', 'wrote', 'wanted', 'laugh', 'juice' and 'splendid'. There is improvement in performance when repeated errors are included. Those errors that the program succeeds in correcting are those that are most often repeated (the exception being those of 'there'). For 7 of the 30 words, more than 90% of the missspellings were corrected. Mitton had previously tested two other spelling correction algorithms with this data @cite(mitton84). He found that 42% of errors (including repeats) would be included as candidates when classed as single edit misspellings (i.e. one edit operation required to correct the error). Depending upon the size and the content of the dictionary, there may be many other candidates. The errors were also coded using the soundex code. For 64% of errors the coding matched for error and correction. Again, many other candidates may also match. Combining the results of the two algorithms, the correction was found to be in the candidate list for 72.9% of errors. For the editcost program alone, the percentage of errors corrected was 83.6%, of which for more than 85% the first word offered was the correction (=71.7% of total). The correction programs, therefore, though designed for use with children with spelling difficulties, could also be used by other children. @newpage @section(Summary) @label(perform-summ) The results presented in this chapter show that the spelling correction programs, developed in this study, were successful in correcting the errors made by children with learning difficulties in spelling. The editcost program was the more successul of the two. As it was designed to deal with both phonetic and non-phonetic errors, whereas the phoncode program was designed to deal with phonetic errors, this was to be expected. The editcost program succeeded in offering corrections for more than 80% of the errors made in the two studies. The phoncode program succeeded in offering corrections for 57.4% of errors tested. Of those the phoncode program failed to correct (42.6%), 37.5% were judged not to be phonetic. In combination the two programs provided corrections for 87.2% of errors over both studies. The success of the programs is restricted by the correction being in the dictionary: if it is not in the dictionary it cannot be offered to the user. The programs were also tested on independent data and found to be successful: 78.2% of unique errors made were corrected by the combined programs; 86.3% of errors in the corpus (including repeats) were corrected. Of the corrections made by the editcost program, for 71.7% (of the complete corpus) the intended correction was the first word offered. This compares favourably with other algorithms tested on the same data. The program would, therefore, be suitable for use by children with no specific difficulties. Evidence is provided that there are regularities in the errors made by children with spelling disabilities. In testing the editcost program, 13.7% of failures to correct errors were attributed to there being insufficient regularities to enable correction. 80.6% of errors were successfully corrected and 5.7% of failures were attributed to failure on the part of the program. Thus, for 86.3% of errrors there was sufficient regularity in the misspelling to permit correction of the error. In considering the results of the phoncode program, 57.4% were corrected overall and a further 5.1% attributed to failure on the part of the program. 37.5% of errors, therefore, were assessed as being non-phonetic - that is, phoneme-grapheme correspondences on which they were based did not conform to the grammar provided. However, 62.5% of misspellings did conform to the grammar. It is argued that there are regular phoneme-grapheme correspondences in the children's spellings, and that there are also additional regularities in the orthography (according to the additional corrections by the editcost program). That the programs succeed in correcting a large proportion of the errors made also demonstrates that these regularities can be used by the programs to reconstruct the corrections from the errors. The information incorporated in the programs, based on the description of the errors in terms of format, general classes of characters and rules, and phoneme-grapheme correspondences, enables successful debugging of the error to provide the correction. The description or errors in these terms is also, to a large extent, validated by the results.