1. INTRODUCTION. There seem to be two main concerns and goals behind the work on natural language using computers, converging at many points. One goal, that of more natural communication with computers, has its roots in the concern for application of the fruits of computational research, amplified by the recent race to build a new generation of computers, in order to make them even more user-oriented and more useful for our purposes. The other goal is theoretical investigation of human language, and some researchers hope that the computer might be a very useful tool in this task; this hope is expressed by Winograd (1983: 13) thus 'The computer shares with the human mind the abiblity to manipulate symbols and carry out complex processes that include making decisions on the basis of stored knowledge. Unlike the human mind, the computer workings are completely open to inspection and study, and we can experiment by building programs and knowledge bases to our specifications. Theoretical concept of program and data can form the basis for building precise computational models of mental processing. We can try to explain the regularities among linguistic structures as a consequence of the computations underlying them.' After over thirty years of computational experience with natural language, there are good reasons for looking at work on automatic NLP in its own right, as an intellectual tradition. At the same time, however, it has become plain that the approach whereby computational parsing is an application of (preferebly well-founded!) linguistic theory can be much more rewarding, both in its resulting descriptive power and theoretical significance (cf. Wilks and Sparck Jones, 1985). The 'revolution' in linguistics, of which Chomsky (1957, 1965) is generally considered to have been the initiator, has brought a different approach to language - generative approach. Some linguists (cf. Newmeyer, 1980) feel that we have learned more about language within the last thirty years than in all the ages taken together, which might indeed well be the case. At the same time, the 'Standard Theory' had many flaws in it and had to be first 'extended' and then 'revised' in order to account for the linguistic facts that it was found not to account for adequately. In addition, many researchers abandoned the typically 'Chomskyan' model altogether and started looking for a fresh one - Lexical Functional Grammar (Bresnan (1978, 1982), General Phrase Structure Grammar (Gazdar, 1982, 1983 and Gazdar et al, 1985) or Categorial Grammar (Steedman, 1985) may be seen as examples of this trend. Yet there is an even more serious flaw of the Chomsky's paradigm which may call for a major revision of our approach to natural language: in a vast majority of cases, its findings are mainly based on the analyses of just one language, namely English. In many respects, English, even though it may be an interesting language from the linguistic point of view, has many facets in its grammatical organisation strictly parochial in nature (idiosyncracy in language is not such a strange phenomenon after all), and obliterates many of the problems that are found in many other languages. One of such features of English is relatively tight formal/structural organization, which can be seen as a reason for the fact that quite a large subset of its syntax can be described using the Immediate Constituent approach. Another weakness of 'standard models' of language description was their unbalanced distribution of the descriptive burden, most of which was carried out 3 by syntax. Generative Semantics, the Lexicalist Hypothesis, and then LFG and the most recent approaches have tried to remedy this situation, either as the first theory, by 'turning' the model, to have semantics equal to Deep Structure, or, like the second one, to relegate some more responsibility to the lexicon (especially in the espect of word formation), or giving both to the lexicon and to strong (interpretive) semantics a fair share of responsibility (cf. Cann, 1985; Dowty, 1982). I shall discuss some notions that have to be taken into account in the parsing of natural languages, especially in respect to to 'inflectional' languages, although it seems that, in the approach that will be taken, a unified treatment can be applied to both inflectional and configurational languages. For the sake of clarification: I shall assume the term 'inflectional' for languages like Polish, and 'configuratonal' for languages like English as these terms seem to reflect the charactaristic ways in which the respective languages organize their grammars, as we shall see in latter sections. 2. INFLECTION. Polish has a 'fusional' type of inflection, i.e., the word-to-morpheme ratio is one-to-many and the morphemes are fused into a single unanalysable morph; therefore the Hockett's Word and Paradigm model characterises its morphology most adequately. The role of inflection is to cumulate grammatical functions (Jakobczyk: 385); thus in Polish czyta-l ('[he] read'), 'l' indicates tense, person, number, and gender. Inflectional categories are regular and systematic sets, given a priori. In English, a verbal inflection has to cover one function, namely tense. Thus we can say that an English verb consists of two constituents: STEM + TENSE affix the only inflectional ending expressing simultaneously: tense, person and number in English being -s (Jakobczyk, op. cit.). A Polish one, however, consists of more constituents: STEM + TENSE + ASPECT + PERSON + NUMBER + GENDER These constituents are, to be more precise, functions to be expressed either inflectionally or by other means, eg. a prefix denoting aspect. Although the details of inflection are not essential here, the objective being rather the finding of more general principles, some surface differences between English and Polish in terms of inflection might be in order to give the flavor of the task of practical implementation of the notion of inflection: - gender distinction in the Polish past tense is marked inflectionally, eg. czytal [masc], czytala [fem] ('he/she read'); - occurrence of the inflectional future tense in Polish, not in English, eg. bedzie czytal ('he will read'); - the conjugations of the verb BE are the most similar, although in the case of English conjugation some functions are performed by pronouns, not only inflectionally; 4 - the category which is obligatory in Polish is aspect, but not marked inflectionally, eg. chodzil ('[he] went'), chadzal ('[he] used to go') (cf. App. 1, pp. 1-19 to 1-21); - the categories of person, number and gender are expressed inflectionally in Polish (cf. App.1, file PLEX4a,PLEX4b). 3. WORD, LEXEME, MORPHOSYNTACTIC CATEGORY. Words may be considered to be, besides sentences, the most important units in language. But it seems that, just as in the case of the sentence, which cannot be given a satisfactory one-sentence definition, but rather it must be defined in terms of a device (grammar) generating sentences in a particular language, the notion of the word cannot be defined without constructing a dictionary of that language. Thus sentences are objects generated by the grammar of a language, while words are objects listed in the dictionary to the left of each lexical entry. Krzeszowski (1981) gives a definition in which words are put in contrast with other units of linguistic analysis: '... a word is the smallest significant unit of a given language, which is internally stable (in terms of the order of component morphemes) but potentially mobile (permutable with other words in the same sentence)' (p. 134). This definition makes it possible to distinguish between the word and the phrase (not the smallest significant unit), the word and the morpheme (not positionally mobile within a word), as well as the word and the phoneme (not significant). In this way the definition isolates lexicology from syntax, morphology and phonology. However, all these areas are mutually interrelated, and in the actual analytic practice it is often difficult to make categorial decisions, esp. lexicology and syntax. In the languages in which words cannot be easily isolated in texts, the problem of separating lexicology from grammar is especially acute. Moreover, we can consider as words not only compounds (blackboard, typewriter), but also phrases of various degree of fixedness, such as red tape (= bureaucracy), kick the bucket (=die), etc.. If one accepts the view that the lexicon has to deal with compounds and fixed expressions, one faces a formidable task of delimiting the upper bound of lexicology, separating it from syntax (cf. Becker's interesting exposition of the frequency with which these expressions are used in ordinary speech). The basic problem is to what extent constraints on collocations of particular items in syntactic constructions are subject to listing in a dictionary and to what extent they are statable in terms of rules (cf. the treatment of idioms in GPSG in Gazdar et al, 1985). This in turn is connected with a more general problem of what might be called 'precision' of grammars. Early transformational models were crude in that they imposed no constraints on the co-occurrence of various content words (nouns, verbs, adjectives, adverbs) in syntactically well-formed preterminal strings. Selection restrictions were introduced by Chomsky (1965); this, however, is problematic since it requires introduction of appalingly large number of theoretical concepts called semantic markers. 3.1. Lexeme, Word-Form, Grammatical Word. Generally, on one hand, a word is an element of the dictionary of a language, as a unit composed of elements/'units' of 'meaning'; on the other hand, a word is an 5 element of the grammatical system of a language as a basic unit of syntactic operations and as a point where all sorts of grammatical relations converge. In the first sense, the term lexeme is used, to denote 'word' as an abstract unit of the dictionary. Lexemes, when they enter into syntagmatic relations with other parts of the sentence, may be realized by more than one word form. For instance, the lexeme PALIC ('to smoke') may be realized as bedzie palil ('[he] will smoke'). Bedzie palil is a unit which has a syntactic function as a whole rather than as two individual units bedzie and palil . Also, in the dimension of paradigmatic relations, bedzie palil is as a unit, in the functional opposition to pali ('[he/she/it] somkes'), palil ('[he] smoked'), pal ('smoke!'), palic ('to smoke'), palac ('smoking'). 1 It must be underlined that the element palil in bedzie palil is not composed of 2 the unit bedzie (i.e., < BYC + future tense >) and palil (i.e., < PALIC + past 1 2 tense >; palil and palil are two different functional units of the language, despite their identical phonological form, thus 2 palil < past tense > and 1 palil < future tense > are two different grammatical units of equal functional status. Being such, they enter certain syntactic operations, and are called grammatical words/forms. The term 'grammatical word', then, is used to denote a functional unit of a language. (There is a further distinction of terms phonological word and orthographic word, but this is not essential for present purposes). Word-form is the smallest segment of text which fulfils at least one of the following conditions (after Grzegorczykowa et al, 1984): 1) it may occur as an autonomous utterance, eg. tak ('yes'), czyzby ('really'), jemu ('to him'), lampe ('the lamp' < GENITIVE >); 2) it is in syntagmatic relation with other elements of the predicate (even though the order can be changed, eg.: Komu mam to dac? Jemu. ('Who shall I give this to?' 'To him') Komu to mam dac? Jaki on jest? Nudny. ('What is he like?' 'Boring.') On jaki jest? chcialbym ('I would like') moze bym chcial 3) is a continuous linear segment, i.e., it cannot be split into subsegments (even despite orthography), e.g. po polsku ('in Polish'), mniej wiecej ('more or less'), dzien dobry ('good day'), etc. But it is not word forms but grammatical words, i.e., functional units defined by a complex of syntactic functions and grammatical categories which are the object of grammatical description. Word-form is interesting for its relations to grammatical word and as an elementary unit of syntactic rules of linear order. Often, word form is identified with grammatical word - it happens where a grammatical word is represented by a single word form, eg. 6 PIES < dat,sing > => psu ('dog') PISAC < ind, pres, 2nd person, sing > => piszesz ('[you write'), etc. A grammatical word represented by a single word form can be called a synthetic grammatical word (synthetic grammatical form). But as seen above, a grammatical word/form can be also represented by more than one word form, as in PISAC < fut, 3rd person > => bedzie pisal ('[he] will write') Grammatical words/forms with this structure can be called analytic (or periphrastic) grammatical words/forms (after Grzegorczykowa et al, op. cit). 3.2. Variants of a grammatical word. The same grammatical word may be represented by different word forms (or sequences of word forms) which are formal variants of a given grammatical word. The distribution of the variants of a given grammatical word may be 1) identical - then we talk about facultative variants of the word, eg. INZYNIER => inzynierowie , or inzynierzy ('engineers') 2) in complementary distribution - positional variants of a given grammatical word; here, a lexeme may occur in several variants, conditioned a) morphologically: BEZ => bez ('without') BEZ => beze ('without') b) phonologically, as in enclitic and stressed forms of pronouns JA,TY,ON ('I', 'you', 'he', respectively), eg. tobie < +stress > ('you') ci < -stress > c) syntactically, eg. => latami ROK ('year') => laty where latami is used only in connection with preposition przy or przed + numeral: przed trzema laty ('three years ago') Of particular interest for my purposes is the status of syntactically determined variants of a lexeme. Thus for instance word forms of lexeme DOBRY ('good'): dobry, dobra, dobre occur in mutually excllusive syntactic contexts, i.e., are in complementary distribution: - dobry enters into syntactic relations only with masculine nouns (in fact, the picture is rather more complicated: the problem of agreement and government will be treated in later sections) - dobra - only with feminine nouns 7 - dobre - with neuter singular and nonvirile plural nouns. These expressions could then be treated as syntactically determined positional variants of one lexeme (DOBRY in this case). However, the form of each of these expressions fulfils also a specific semiotic function: it is an indicator of the gender of the noun with which adjective DOBRY occurs in an expression. The linguistic function of these expressions is then complex. On one hand, thay are an expression of meaning which constitutes the content of adjective DOBRY as a lexical unit, on the other hand, they fulfil an intratextual function, signalling the existence of a syntactic relationship (connection) between that lexical unit and the unit with which it co-occurs in a given expression. Dobry, dobra, dobre are then three different functional units: DOBRY + => dobry DOBRY + => dobra DOBRY + => dobre The ability of a given adjectival form to co-occur with nouns of a specific gender is called a (syntactically dependent) grammatical gender of an adjective, recognizing grammatical gender as a morphological (inflectional) category of an adjective. To conclude, functional and formal differences between dobry, dobra, dobre have a categorial character - they concern many dictionary units which belong to the class of adjectives. This fact is reflected in the syntax, since the rule is general. In contrast, phonological variants of a word, eg. jego, niego, which are non-categorial features of a lexial unit are better expressed in the dictionary information about a given lexeme. At the grammatical level then, a word is referred to by the lexeme to which it belongs and the place which it occupies in the paradigm. Thus czytal 'he was reading' would be represented as belonging to the lexeme CZYTAC 'read', and as occupying the place in the paradigm defined by the terms Imperfect, Indicative, 3rd sing., Active, etc. - the terms being unordered properties of the word considered as a whole, and morphosyntactic, since they have a role both in morphology and in syntax: thus, for instance, it is a syntactic statement that certain prepositions govern certain cases (see App. 1, p.25 for an illustration), and in morphological rules, eg. rules for case inflections. A feature of this representation is that the lexeme, eg. CZYTAC, is a different sort of primitive or elementary term from the morphosyntactic property, eg. Past Participle. This is reflected in the different roles which they are said to play in the derivation of word-forms: the lexeme as the source of the root, and the morphosyntactic properties (plus in some cases the inflectional class) as the features which select the operations which are applicable to it. However, the formal terms which are exemplified (lexeme, word, morphosyntactic category, and so on) can be applied to any language for which these techniques are felt appropriate. Only the substantive or descriptive terms will vary (Mathews, 1974: 67). 4. SENTENCE. In linguistic analysis, sentence is considered as one of the fundamental units of grammatical description. However, what one has a direct access to in his analysis of language is actual utterances. Therefore the distinction introduced by de Saussure between langue and parole, or, later, by Chomsky, between 'competence' 8 and 'performance' which reflects the fundamental difference between a system and its use can be applied to establish both the difference and the relation between sentences and utterances. But each of the terms can be used to refer to different phenomena, I shall therefore try to disambiguate them in the first place. Thus the term 'utterance', in one sense, '...may refer to a piece of behaviour' (Lyons,1977:26), that is, to the process of uttering, the product of this behaviour being a speech act. In the second sense, 'utterance' may denote the product of that behaviour, or process, in the form of inscriptions, i.e., sequences of symbols in some physical medium. As for the term 'sentence', it is defined, in Hurford & Heasley as '...a string of words put together by the grammatical rules of a language [...,] the ideal string of words behind various realizations in utterances and inscriptions (1983: 16). Lyons draws a distinction between a more abtract and a more concrete sense of the term. Thus, on one hand, 'sentence' may denote the product of a bit of language behaviour, a subclass of utterance-inscriptions, the segments of speech (1977:29; 1981:196), and being such, they can be uttered. He calls them 'text-sentences'. On the other hand, he uses the term 'sentence' to refer to 'abstract theoretical constructs, correlates of which are generated by the linguist's model of the language...' (1977:622) They are used by the linguist in order to account for '...acknowledged grammaticality of certain potential utterances and the ungrammaticality of others' (Lyons,1981:196); Lyons calls them system-sentences. As such, they are maximum units of grammatical description. Chomsky's grammar, then, generates system-sentences. In this work, I shall use, especially the term 'sentence' in both the sense of system-sentences and text-sentences, but this type-token distinction should nevertheless be borne in mind throughout. I shall now procede to discuss a few aspects of analysis at the sentence level, considering especially word order. Some different approaches will be looked at, in search for an optimal and, hopefully, the most adequate way of treatment of both inflectional languages as well as, to some extent, configurational languages. Undoubtedly, an adequate grammatical description is a prerequisite for work on computational analysis of natural language. 4.1. Word Order and Grammaticality. From the point of view of formal, structural well-formedness of complex expressions, the following requirements must be fulfilled: 1. formal (syntactic/relational) ability of co-occurrence 2. linear order of the expressions (Topolinska, 1984). In Polish, grammatical incorrectness follows from the not observing of rules of co-occurrence (distribution) of grammatical forms; when linear order has not been observed, the sentence is not ungrammatical; this is not the case in English. It seems that in fixed word order languages, the position of the expression in the linear sequence depends solely on its formal (syntactic) properties. In free word order languages, the position usually does not violate the rules of linear order, and the final ordering depends on the theme-rheme formation rules. Nevertheless, a certain number of linear order rules is subordinated to the rules of structural formation and cannot be formulated independently. Thus two of the following 9 sentences have the wrong linear order: (a) * Sie Jana spodziewam. (a') Spodziewam sie Jana. 'I'm expecting Jan' (b) * Piotr na patrzy mnie. ('Peter' 'at' 'is looking' 'me') (b') Piotr patrzy na mnie. ('Peter' 'is looking' 'at' 'me' 'Peter is looking at me' but they are interpretable semantically. Now, the first of these sentences is grammatically incorrect: (c) * Nasze idziemy do kina. 'our' 'go' [pl,1st pers] 'to' 'cinema' (c') My idziemy do kina 'We are going to the cinema' The second of these, in turn, is ill-formed semantically: (e) Kot pije mleko. 'the cat is drinking the milk' (e') * Radosc pije odwage. 'the joy' 'is drinking' 'the courage' These three sets of rules, i.e., semantic structure formation rules, formal (syntactic) structure formation rules and linear word order rules are to some extent autonomous, since one can form sentences that violate just one set of rules. All three are, however, required to produce fully grammatical sentences. 4.2. Word Order and Configurationality. As it has just been observed, Polish and English differ from each other in terms of the freedom with which constituents may be rearranged within phrases. This general difference has formed the basis of a traditional typological distinction between 'free-word-order' languages such as Polish or Latin and 'fixed-word-order' languages such as English. The theory of grammar should account for this variation in terms of specific parameters embedded within the deductive structure of the theory, so that the contrast between 'free-word-order' and 'fixed-word-order' can be traced to be the options exercised by particular grammars, possibly at the points specified by Universal Grammar (Stowell: 1). Within the scientific tradition of generative grammar, it has generally been assumed that fixed constituent order is determined primarily by the formulae of context-free rewrite rules of the categorial component of the base. Each PS rule defines the internal structure of a particular type of constituent, specifying the set of constituents which it immediately dominates and the linear order in which these constituents appear. For instance, here is a simplified version of the PS rule for VP: VP --> V (NP) (PRT) (NP) (PP) (PP) (S') 10 The categorial rule system is supposed to be responsible for constituent order within phrases - including sentences (S/S') and phrases projected from the major lexical categories (NP,VP,PP, etc). The notation of CF rewrite rules assumed in most versions of Generative Grammar (and both in the Aspects model and in the GB framework) required that a fixed constituent order be established for each phrase type, since the the sytem of grammatical relations is defined structurally. Consequently, free-word-order languages are treated within these theories as being identical to fixed-word-order languages at the level of syntax. Using GB terminology, the right (or LF) side of the grammar is also identical. The difference is in the left (or PF) side of the grammar. There, local optional transformations, called 'scrambling' rules operate to derive the observed surface word orders. Thus in Aspects , Chomsky writes 'In general, the rules of stylistic reordering are very different from the grammatical transformations, which are so much more deeply embedded in the grammatical system. It might, in fact, be argued the former are not so much rules of grammar as rules of performance...' (1965: 127). The scrambling theory seems to lose some of its theoretical interest in the implementation, when one notices that few real predictions are made by it. Worse still, it practically treats as a as a matter of stylistic choices a phenomenon which seems, in many relevant aspects, strictly grammatical (cf. Gorna, 1976: 202; Newmeyer, 1980). Moreover, it addresses the problem of word order without providing any explanation for its underlying mechanisms. There is also a related problem, termed by linguists as configurationality. It has been suggested that the explanation of configurationality and word order related phenomena may be found in the X-bar theory of the categorial component which, among other things, offers us two dimensions: category, and 'type' (or hierarchical depth), along which rules, and therefore languages, may vary (Hale: 3). The dimension of hierarchical depth is probably the central one in relation to the question of configurationality, since it permits phrase markers to be relatively 'flat' or relatively 'hierarchical'. Thus if we take the following set of rules: (1) X" --> ...X'... (2) X' --> ...X ..., two core linguistic types can be defined along this dimension: (a) two-bar languages, i.e., such whose grammars utilize the endocentric PS rule schemata (1) and (2); such languages may be termed configurational (b) one bar languages, whose sole endocentric rule schemata is (2); these may be termed nonconfigurational. The respective structures of the two language types would be as follows: (a') for two-bar languages X" / A" X' / X B" 11 (b') for one-bar languages X' / | / | A' X B' (also cf. Cann, 1986: 28,30 for a similar approach). Indeed, it seems the case for Polish that we only need to distinguish between two levels in the structure of its PS rules: the 'lexical' (X) level (i.e., one where the 'head' of the phrase is), and the 'phrasal' (XP) level. The distribution of the dependent categories in both types of languages will, as a consequence, also differ: (a') for two-bar languages it might be like this XP / SPEC(X') X' / MOD(X') X' / ARG(X) X (b') and like this for one-bar level ones XP / / | / / | / / | SPEC(X) MOD(X) X ARG(X) (The linear order is arbitrary). A common characteristics of configurational languages, not a defining criterion, but a fairly consistent property nonetheless, is relative 'tightness' of grammatical organization - in particular, a relatively straightforward and consistent relationship between theta-role assignment and structural position. In short, in configurational languages, grammatical grammatical principles are typically articulated in structural terms; thus theta-roles are assigned to structurally defined positions. In this regard, non-configurational languages are characterized by much greater 'looseness' of grammatical organization. Hale (op. cit.) suggests that the principle of government, which is behind theta-role assignment, being entirely derivative of siterhood (since it is a relation that holds between the head of a category and its immediate sisters), can operate only in a structure like (a) above - thus it 'clicks on' in two-bar languages; in a 'flat' structure like (b), government as defined above cannot serve to partition a structure into distinct sub-phrasal domains - therefore this principle 'shuts down' in one-bar languages (op. cit., p3). In configurational languages, principles such as abstract case assignment and theta-role assignment are dependent upon government. In contrast, in non-configurational languages, configuration alone cannot account for different case and theta-role assignment. Therefore, if a NC language uses case, it is inherent case, not assigned one. Thus in Polish, for example, inherent case is case associated with nominal expressions by virtue of the word-formation component alone. And with respect to the notion of theta-positions (i.e., a position to which a theta-role is assigned, all positions are theta-positions in NC languages. 12 It has even been suggested that free constituent order may be due to the fact that the inclusion of a component of PS rules in a particular grammar is subject to parametric variation. More specifically, one could suggest that free constituent order in languages like Latin follows from the fact that the grammars of these languages lack PS rules entirely. Although these languages exhibit certain limited restrictions on constituent order, Hale proposes that these follow from requirements imposed by interpretive rules - but they in fact lack PS rules. It seems, however, that configurational/nonconfigurational distinction should rather be attributed to to the presence or absence of 'linear precedence' rules in particular grammars (see for example Gazdar et al, 1985 for an elaboration on the operation of LP rules in English). The role of these rules was also illustrated in the previous section. All of the theories of phrase structure cited above share one assumption: that restrictions on constituent order in configurational languages such as English are imposed by CF rewrite rules. However, a diferent approach may be taken: we could assume that the component of categorial rules simply does not exist in any language; all languages could then be treated as being essentially non-configurational, from the perspective of the theory of phrase structure (Stowell, p2). In this approach, the base component is category-neutral, and certain apparent instances of category-specific properties of phrase structure are really due to the operation of a class of word-formation rules (op. cit., p.3). But if there are no PS rules, then it is not clear how to account for the fact that constituent scrambling is normally constrained so as not to apply across clause boundaries, suggesting that even languages with entirely free word order maintain some constituent structure. In addition, certain restrictions on discontinuous NPs and the placement of the Aux receive a natural account in terms of constituent structure. A solution to these problems is offered by the theory of Japanese phrase structure developed by Farmer (1980, quoted by Stowell, op. cit.). He proposes that the base component of Japanese is category-neutral; in other words, the PS rules of the language may refer only to the primitives of X-bar theory, and may not make use of categorial features. First, lexical insertion is context-free, abstracting away from subcategorization requirements, since no structural position is reserved for any specific category by the base rules; this derives 'scrambling' as an automatic consequence. There is no redundancy with strict subcategorization, since it is impossible for PS rules to specify which categories may occur as complements in NP or VP. But phrases and sentences must still conform to the hierarchical structure defined by the category-neutral base, and the complements of each verb must appear within V' in order to satisfy strict subcategorization. Thus the integrity of clausal structure is maintained, and 'scrambling' across clause boundaries is ruled out. Finally, phrasal nodes have no intrinsic categorial features, and they acquire a categorial indentity only after lexical insertion has placed a particular lexical category in the head position of a phrase. Hence all phrases are of necessity endocentric, and hierarchical structure is constant across categories (Stowell, op.cit.). But it is not clear whether this account, as its consequence, postulates that all phrases have a parallel structure - it seems that they have not (cf. Cann, 1986): each major phrase type has numerous idiosyncratic properties, suggesting that category-specific PS rules are required for each category, even given X-bar theory (suffice to compare noun phrases and adjectival or adverbial phrases). It seems that certain aspects of constituent structure must be accounted for directly by rules which determine hierarchical PS configurations. One of such PS rules that is required is one that orders the head term X of a phrase XP at the right/left boundary of a level. This holds constant across all major categories: NP, AP, VP, PP. 13 However, category-neutral PS rules can define positions for the head, complements, i.e., the hierarchical structure of phrases; theta-role theory and case would then combine to account for the distribution and ordering of various types of arguments at each X-bar level; the rest could be dealt with in an extended component of word-formation rules - this is exactly the position that Stowell takes (op. cit.,); a similar conclusion in many respects seems to follow from Cann (op. cit.), and this seems indeed a very attractive approach to me. Thus the properties of PS rules relating to the hierarchical and linear ordering turn out to follow from the interaction of a number of distinct components of grammar. To illustrate the effects of the interaction of the hierarchical structure rules and other components of the grammar, let us have a look at a noun phrase in Polish, eg. ta moja stara skrzynka z drewna ('this my old box of wood') (Appendix 2: fig. 2.21, p. 2-11). Instead of a tree diagram, we could contruct something like a 'hierarchical' diagram or 'slot' diagram, i.e., such where each category fills in a certain slot in the structure of a phrase. Thus for our noun phrase, the diagram would look as follows: NP | * <-* <------------------------Nspec | | | | | * <-------------------Nmod | | | | | | | * <-----------Nhead | | | | | | | | | * <----Ncomp | | | | _|_ | | | | | | ta moja stara skrzynka z drewna If each constituent is assigned some value which would indicate the place where it it belongs in the construction, and the semantic interpretation is compositional in the sense of Montague (eg. as in Dowty, 1982 or Cann, 1986), it follows that the correct interpretation of the phrase would be made irrespective of the order of the constituents in the phrase (up to the level of ambiguity, that is); so the order might also well be like this NP | * <---- * ------------------ Nsp | | | | * <- | ------------------- Nm | | | | | | | * <--Nh | | | | | | | | * <----- | <---Nc | | | _|_ | | | | | | | ta stara moja z drewna skrzynka This kind of interpretation of each constituent could be made possible irrespective of the position of the constituent even within a larger construction, up to the sentence level. A corresponding traditional tree-structure diagram would then 14 have 'crossing' branches - and I do not see any reason why it should not have. There are several consequences of this phenomenon, and they do seem to be present in languages. Thus, for instance, in a non-configurational language, a verb must subcategorize for lexical case (i.e., it governs a NP in a particular case, eg. for verb czytac ('to read') in Polish, the subject noun must be in Nominative and the object noun in Accusative - see App. 2, diagrams 2.22 and 2.23). In contrast, in English, the adjacency condition would be responsible for imposing a fixed order of components, since almost no lexical case marking exists. Obviously, in a free-word-order language, it would be necessary for NPs to bear case markings when there are several NP complements. But this does not entail indiscriminate free constituent order in these languages. Thus prepositions and postpositions normally occur with just a single NP. Also the adverb bardzo ('very') in the phrase bardzo dobra skrzynka ('a very good box') would usually occur next to the adjective - possibly because adverbs have no case marking in Polish! To summarise, a non-configurational language differs from a configurational language in that morphological form (eg. affixes), rather than position (i.e., configuration), indicates which words are syntactically connected to each other. In English, the grammatical, and hence semantic, relationships are indicated in surface form by the position of words, and changing these positions changes the relationships, and hence the meaning. In contrast, in Polish, the relationships bewteen words are indicated by the affixes on the various nouns, and to change the relationships, one would have to change the affixes. It seems then, that in these languages morphological form plays the same role that word order does in a configurational language (English, for example). This is why, I think, it is more appropriate to term Polish as an inflectional language, in opposition to a configurational language like English. Obviously this a tendency rather than an absolute, and certain elements of both types would be found in both languages (i.e., some information about grammatical relations can be obtained either through word order or morphology), but it is a tendency strong enough to make a clear distinction, especially when two given languages are on different ends along a given parameter. 4.3. Theme-Rheme Structure (Functional Sentence Perspective). I shall now discuss a few aspects of the Functional Sentence Perspective theory, developed by Prague School linguists, such as Mathesius and Firbas, in order to see how it might account for certain word order phenomena in Polish. The semantic (communicative) structure of a sentence requires that the object or a set of object talked about and information given about it to be indicated. A sentence is then said to be composed of theme and rheme (or topic and comment), in other words, it has a theme-rheme structure, also called functional sentence perspective (Topolinska, 1984: 31). There is a noticeable connection between the definition of theme and the semantico-syntactic definition of argument, and also between the definition of rheme and the definition of predicate. Namely, in elementary sentences composed of one-argument predicate and in derived sentences with multi-argument predicates used with one argument as a result of unfilling the other argument positions, there is a one-to-one correspondence between the argument and theme, and the predicate and rheme, as in: theme: argument / rheme: predicate 15 Jan / czyta, 'is reading' Piotr / jest zonaty. 'is married' A more complicated situation is found in elementary sentences with multi-argument predicates - any argument may be the theme. Most often, however, there is just one theme even where several arguments are present. Rarely, allarguments may jointly constitute the (complex) theme. Special constructions are used in Polish, like jezeli chodzi o ... , to... co do .... , to ... 'as concerns ..., ... ', more rarely the change of the order (permutation) of arguments is involved: Jezeli chodzi o Piotra i Anne, to on sie z nia ozenil. Co do Piotra i Anny, to on sie z nia ozenil. 'Peter and Ann got married' The complex theme in the above sentences is represented by the expressions 'Piotr i Anna' . In sentences with multi-argument predicates which have one simple theme, any of the arguments, regardless of its form, may be chosen to serve as theme, depending of the communicative intention of the speaker. On the surface, then, theme does not have to be always represented by a noun in the nominative case, although in utterances which are not closely interrelated in a broader context the noun in the Nominative would usually fulfil the function of theme, here are a few examples: Piotr / ozenil sie z Anna. 'Peter got maried to Ann' Z Anna / ozenil sie Piotr. Piotr / kocha Anne. 'Peter loves Ann' Anne / kocha Piotr. (English translations for the second of each sentences might be the so called 'topicalized' constrution: It was Ann that Peter married. It is Ann that Peter loves.) In all of the above sentences, theme is always represented by the first noun. Sentences with theme expressed by nouns in the nominative, occupying the initial position, are the most natural constructions, functioning independently of the context. They are regarded as unmarked, or neutral, in respect to the theme-rheme ordering of their elements, in other words, they have a neutral functional perspective. In turn, expressions with the same predicates but in which theme is represented by a non-nominative form of the noun, are regarded as marked, and they usually function in a more general context. 16 The choice of one argument as theme means invariably the shift of of the rest of the arguments to the rheme position in the sentence. Rheme is then a complex structure resulting from the absorbtion by the predicate some of the arguments; in other words, the prdicate has the abiblity to order (or hierarchize) differently its arguments - they have the ability to create different theme-rheme structures, with unmarked (neutral) or marked orders. It follows then, that predicate-argument constructions are not ordered in Polish; they are just differentiated according to their predicates. Thus the first argument (x) in ma(x,y) ('has(x,y)') will have a noun in nominative, or non-nominative forms in its converses (cf. Topolinska, 1984), i.e., nalezy_do(x,y) ('belongs_to(x,y)'), for example. Here are some examples of sentence (i.e., theme-rheme) structures for sentence 'Peter is getting married to Ann', (the original Prague School terminology has been preserved): 1. Piotr z Anna / sie zeni ('Peter' 'to Ann' / 'is getting married') , S / / / / TLD / / / / / / / / / M(od) T(im) L(oc) D(ictum) / / / / / / / / / / / Tm Rm / / / / / / / x y g / / / | | | 0 [pres] 0 Piotr Anna sie_zeni 2a. Piotr / zeni sie z Anna ('Peter' / 'is getting maried' 'to Ann') (the tree will be abbreviated now): D / / / Tm Rm | | | | x g y | | Piotr zeni_sie Anna 17 2b. Z Anna / zeni sie Piotr. ('to Ann'/ 'Peter is getting married ') D / / / Tm Rm | | y g x | | Anna zeni_sie Piotr (slightly adapted from Topolinska, 1984: 36,37). The theme-rheme structure seems, then to neatly account for certain phenomena of word order in Polish. It also appears that including theme-rheme distinction can be beneficial in English too, especially within the context of the communicative dynamism theory, also part of Prague School linguistics (also see Kay, 1975). 4.4. Grammatical Relations. Most descriptions of syntax have assumed that, perhaps in addition to semantic and pragmatic roles, there are also purely syntactic relations contracted between a NP and its predicate, which, however closely they may correlate with semantic or pragmatic relations, cannot be identified with them. These might be called grammatical relations (but the term grammatical here does not have the narow sense of syntactic), and they can be called subject (Sub), direct object (DO) and indirect object (IO). It will assumed, afer Comrie (1981), that grammatical relations do exist, but unlike much recent work on grammatical relations (in particular, relational grammar), it will be argued that much of syntax can be understood only in relation to semantics and pragmatics (cf. also Bach 1982, for a similar view), or more specifically, that grammatical relations cannot be understood in their entirety unless they are related to semantic and pragmatic roles: '... at least many aspects of the nature of grammatical relations can be understood in terms of the interaction of semantc and pragmatic roles: for instance, many facets of subjecthood can be understood by regarding the prototype of subject as the intersection of agent and topic...' (Comrie, 1981: 60). In much recent work on grammatical relatins, it is taken for granted that certain grammatical roles exist as given by the general theory, and that the linguist looking at an individual language has to work out which NPs in this language evince these particular roles - of course oversimplifications are imminent to avoid certain pitfalls (e.g. problems with indirect objects in English) (Comrie, op. cit.). But there is a discrepancy between syntax and morphology, i.e., that the morphology is arbitrarily relative to the syntax (or vice versa), eg. the use of a certain case to exhibit a grammatical relation. Rather, interaction of parameters: semantic roles, pragmatic roles, grammatical relations and morphological cases should be considered. If Polish and English are compared, it turns out that they may be regarded as being radically different types along a given parameter. Thus in English, there is a high correlation between grammatical relations and word order; indeed, word order 18 is the basic carrier of grammatical relations, esp. of subject and direct object, as can be seen in John loves Mary. Mary loves John. The position immediately before the verb is reserved for the subject, while position immediately after the verb is reserved for direct object; even in corresponding questions, changing of the word order therefore changes the grammatical relations, and ultimately the meaning of the sentence. Also, relations and valencies of verbs can be considered. Thus English has many verbs that can be used either transitively or intransitively. When used transitively, the subject will be an agent; when used intransitively, the verb will have a patient as its subject, as in John opened the door. The door opened. Next, in English, morphological marking of NPs plays a marginal role. To be sure, most pronouns have a nominative vs. accusative distinction, as in I saw him. He saw me. However, the existence of this case distinction does not provide for any greater freedom of word order: sentences like *Him saw I. *Me saw he. are simply ungrammatical in the modern language. Moreover, except in very straightforward examples like the above, the correlation between case and grammatical relation is rather weak; for instance, many speakers of English have the pattern illustrated below: a) John and I saw Mary. b) Me and John saw Mary. Here, the difference between I and me is in turn conditioned by register ((b) is considered more colloquial than (a)), although there is no difference in grammatical relations (op. cit: 70). Generally, in English, pragmatic roles play a very small role in the syntactic structure of sentences; hence the choice between alternative syntactic means of encoding the same semantic structure is often determined by pragmatic considerations, one of the principles being a preference to make the topic subject wherever possible, thus leading to a correlation between subject and topic. In contrast, in Polish, the basic marker of grammatical relation is is not word order, but rather the morphology; thus the noun in the nominative is the subject and the noun in the accusative is the direct object in a majority of cases. Changing the word order does not affect the distribution of grammatical relations or semantic roles, and any of the logically possible permutations of major categories might be accepted - up to the level of ambiguity. 19 But the permutations are by no means equivalent; in particular, they differ in terms of the pragmatic roles expressed. The basic principle in Polish, esp. in non-affective use, is that topic comes at the beginning of the sentence, and the focus at the end. Let us then look at some specific examples taken from Polish (I hope that the 'violence' of the example will be excused): (1) - Kto uderzyl Piotra? - Piotra uderzyl Janek. - 'Who hit Peter?' - 'Jan hit Peter' (2) - Kogo Janek uderzyl? - Janek uderzyl Piotra. - 'Who did Jan hit? - 'Jan hit Peter' (3) - Pawel uderzyl Roberta. - A Janek? - Janek uderzyl Piotra. - 'Pawel hit Robert' - 'What about Jan? - 'Jan hit Peter' (4) - Pawel uderzyl Roberta. - A Piotra? - Piotra uderzyl Janek - 'Pawel hit Robert' - 'What about Peter?' - 'Jan hit Peter' Note that in examples (3) and (4), the distinction between the nominative ( Jan ) and the accusative ( Piotra ) is crucial to the understanding whether the question is about the bully or the victim; this is not brought out in the English translations, which would, to carry the same amount of information, have to be more explicit, eg. (3) and who did Jan hit?, (4) and who hit Peter? It seems then, that Polish and English differ in that in English word order is determined by grammatical relations and independent of pragmatic roles; in Polish, morphology is determined by and carries grammatical relations, while word order is determined by pragmatic roles. (Comrie, op. cit.: 72; also cf. op. cit. for an account of differnces between Russian and English in terms of the interaction of semantic roles with grammatical relations). Moreover, Polish has no syntactic equivalent of object-to-subject raising, thus a translation of It is easy to solve this problem. might be Latwo (jest) rozwiazac ten problem. Given the free word order of Polish, it is also possible to move the NP to the beginning of the sentence: 20 Ten problem jest latwo rozwiazac. ('This problem is easy to solve') We observe then the syntactically different means of encoding the same set of semantic roles. Thus the equivalent of English passive would be, in Polish, the active with the word order DO-V-Sub, as in Piotra udedrzyl Janek. ('Jan hit Peter/it was Peter who Jan hit') Piotr byl/zostal uderzony przez Janka. ('Peter was/has been hit by Jan') but in fact a subjectless construction with the verb in the 'impersonal form' is more usual: Piotra uderzono. ('Peter has been hit') It seems then the basic function of the Polish passive is not as much pragmatic as stylistic (likewise Russian, cf. Comrie, op. cit.: 76). To summarise, in English, the grammatical relations play a much greater role than in Polish. Firstly, the grammatical roles in English are more independent than in Polish, with a low correlation in English between grammatical roles and either syntactic or pragmatic roles (or morphology, which is virtually non-existent). Secondly, there is a wider range of syntactic processes in English than in Polish, where grammatical roles and changes in grammatical relations are relevant. In Polish, semantic and pragmatic roles (and even morphology) play a greater role than in English. There is, obviously, a difference of degree. Some grammatical relations may be crucial in Polish, for instance verb agreement with the subject. There is no way in which agreement can be reformulated as agreement with agent or topic: (1) Anna uderzyla [fem] Pawla. 'Anna hit Paul' (2) Pawel zostal uderzony [masc] przez Anne. 'Paul was hit by Anna' (3) Pawla uderzyla [fem] Anna. 'Paul, Anna hit' Also, only direct objects can become subject of the passive construction. However, some NPs do not make a nominative-accusative distinction, or morphology is ambivalent and both interpretations make sense - then preference is given to a SUB-V-DO interpretation - a preference rather than an absolute. It seems then that the interaction of semantis, syntactic and morphological relations can characterize a part of syntactic differences between languages with a different mode of organization better than does word order on its own, especially when in terms of the basic word order the languages are similar. This makes quite obvious conclusions for building parsers for respective languages in order to adequately account for their organization. 21 5. Processing Free-Word-Order Languages. There have been some attempts to account for freer word order, both in practical implementations, i.e., actual parsers, and in strictly theoretical terms. I shall examine briefly WEDNESDAY, a system which is lexically based, Johnson's unconstrained syntactic parser, and LFG, which, through its order-free composition claims to have the potential to account for free word order. 5.1. WEDNESDAY. WEDNESDAY (Castelfranchi et al, 1982) is lexically based system designed to cope with the freer word order of Italian. In the system, syntax is not a separate component, but is distributed throughout the lexicon. According to its authors, the parser can deal with, inter alia, flexible idioms that can vary in morphology, word order, a variety of syntactic constructions, semantic additions, and synonyms, whose recognition is governed by the individual lexical entries and takes place at the assembling level. Unfortunately, I have not been able to obtain more details about the system. 5.2. Johnson's Parser. Most attempts to describe this word order freedom take mechanisms designed to handle fairly rigid word order systems and modify them in order to account for the greater freedom of NC languages. Johnson (1985) produces an unconstrained system instead. In order to describe dicsontinuous constituents in non-configurational languages, he generalizes the notion of location of a constituent to allow discontinuous locations; these discontinuous constituents can be described by a variant of DCGs. In this approach, discontinuous constituents are represented directly, in terms of a syntactic category and a discontinuous location in the utterance, e.g., a given constituent could then have a following location: {[0,1],[3,4]}; obviously, this a manner reminiscent of vertices in a chart. 'Head-driven' aspects of Head Grammars (eg. Pollard's), whose conception is that heads contain as lexical information a list of the items they subcategorize for, are adopted, but he claims that the use of this method is not limited to DCGs but can be incorporated in GPSG, LFG of GB. If this approach turns out to be practical, it has a very interesting consequence in that it would offer a unified account of parsing both configurational and non-configurational languages. Its major shortcoming, however is the fact that it is too powerful, unconstrained. No language exhibits total scrambling, and therefore, in order for the parser to be interesting linguistically, it would have to have constraining mechanisms built on top. But it is generally known from the experience with ATNs that constraining a Turing machine is very difficult, if not altogether impossible. It would be more desirable if, instead, the parser's constraints were the result of the principles of its internal organization. One way of ensuring this could be to provide a sufficiently constrained grammar. 5.3. Order-Free-Composition in LFG. In LFG, the nature of the mapping between c-structure and f-structure enables it to achieve many of the effects of discontinuous constituents, even though the PS component (the c-structure) does not allow discontinuous constituents as such. In 22 particular, the information represented in one component of the f-structure may come from several constituents located throughout the sentence. So LFG is capable of describing the 'discontinuity' without using discontinuous constituents. There is, however, a subtle difference in the amount of 'discontinuity' allowed by the LFG and the extent to which 'discontinuous' constituents actually occur in non-configurational languages. Invariably, the problems with many discontinuous constituents would occur since, in LFG, the position of any item in the sentence's f-structure is determined solely by the f-equation annotations attached to it, and no new components in the f-structure can be created (Johnson, 1985: 15). 6. CONCLUDING REMARKS. Certain obvious conclusions as to the way sophisticated, powerful parsers should be built could be drawn from what has been said so far. As concerns the grammar, organising the lexicon in a certain way (eg., along the lines suggested in previous sections) has an advantage of making PS rules to a large extent 'category-neutral' in the sense of Farmer and Stowell (op. cit), and, in effect, an efficient implementation of this type of rules for free-word order languages. Moreover, both free- and fixed-word-order languages could be treated in a unified way, with the rules of linear precedence operating more strictly in languages like English. This also seems to be a solution for parsing partly ungrammatical or elliptical sentences. Obviously, in this approach, the descriptive burden on the lexicon is considerable. So is the role of semantics. a strong model-theoretical semantic component in Montague tradition is envisaged. Another consideration in building a sophisticated parser seems very important, as we have seen in the theoretical discussion of linguistic description, and it is, in the words of Lyons, that '...uterances are produced in particular contexts and cannot be understood (even within the limits imposed [...] on the interpretation of the term 'understanding'[...]) without a knowledge of the relevant contextual features [...] '...we must not lose sight of the relationship between utterances and particular contexts...' (1968: 419). This is largely reflected also in the way humans do it: inferential processes are brought into play as soon as we start to read any phrase or text. There are some very powerful tools already available for the use in the computational approach: ATN-based systems, Marcus-type parsers or charts. In particular, the chart, both in its role as an indexing scheme for constituents and as an active parsing agent, can be seen a very desirable tool, its main virtue being, besides the 'bookkeeping' role, the fact that it allows to treat independent pending hypotheses independently, and also its enormous flexibility and perspicuousness. Chart parsing is often seen as a rival approach to Marcus-parsing. Marcus is an appealing parsing method, since we can go to find local clues which enable the parser to select properly what to do next. This idea seems advantageous for free word order languages since rich inflection makes the local clues more explicit and the parser's expectations more precise. But there are problems with writing grammar rules for a deterministic parser; on one hand, there is a heavy responsibility on the grammar-writer to find a way of writing his grammar so that the parser will be efficient and linguistically interesting (Ritchie, 1983:79); on the other hand, it does not predict certain properties universal in languages (Briscoe, 1983: 67). 23 From the point of view of actual implementation of natural lanuage processing systems, there has been an increase of the application in at least two distinct directions in the past few years. Thus on one hand, there has been a marked trend towards the construction of large, efficient parsers for use as front ends to data bases or intelligent systems. On the other hand, as a result of a re-evaluation of the fundamental notions of parsing and syntactic structure, viewed from the perspective of programs that understand natural language, systems are being designed ehich attempt to capture in their design the same kinds of generalizations that linguists and psycholinguists posit as theories of language structure and language use. Thus the Marcus parser, for instance, can be seen as an embodiment of an approach, however fragile as yet, in which attention is towards the interaction between the structural facts about syntax and the control structures for implementic the parsing process. Also, the current trend is away from simple methods of applying grammars (as with PS grammars), toward more integrated approaches. This could be spurred on by the fact that parsers are geing called upon to handle more 'natural' text, including discourse, conversation, and sentence fragments. These involve aspects of language that cannot be easily described in the conventional, grammar-based models. 24