Formal grammar
Formal grammar
A formal grammar is a mathematical logical structure with a set of training rules that define the permissible character strings in a given formal language or natural language . Formal grammars appear in several different contexts: mathematical logic , computer science and theoretical linguistics , often with divergent methods and interests.
In a formal language, strings formed according to the rules of formal grammar are called well-formed formulas , and the set of all well-formed formulas constitutes a formal language . A formal grammar does not describe the meaning of well-formed formulas, but only their form. The theory of formal languages studies formal grammar and formal languages, and is a branch of applied mathematics . Its applications are found in theoretical computational science , linguistics , formal semantics , mathematical logic and other areas.
In formal language theory , a formal grammar (sometimes simply called grammar ) is a set of string production rules in a formal language , that is, an object that allows you to specify a language or language. The rules describe how to form strings – from the language alphabet – that are valid according to the language syntax . A grammar does not describe the meaning of strings or what can be done with them in context – only their forms.
The expression “formal grammar” can have the meanings:
- grammar of a formal language ;
- Formal description of part of the grammar of a natural language .
The Theory of Formal Language , discipline studying grammars and formal languages, is a branch of applied mathematics . Its applications can be found in Theoretical Computer Science , Theoretical Linguistics , Formal Semantics , Logical Mathematics , among other areas.
A formal grammar is a set of rules for rewriting strings, starting with an initial symbol from which rewriting begins. Therefore, a grammar is usually regarded as a language generator. However, it can also be used as a basis for a recognizer – a computational function that determines whether a given string belongs to the language or is grammatically incorrect. To describe such recognizers, formal language theory uses distinct formalisms, known as Automata Theory . One of the most interesting results of automaton theory is that it is not possible to draw a recognizer for certain formal languages.
Syntactic parsing is the process of recognizing an utterance (chain in natural language ) by breaking it into a set of symbols and parsing each according to language grammar. The meaning of the utterances of most languages is structured according to the syntax of such language – a practice known as compositional semantics . As a result, the first step in describing the meaning of a language utterance is to break it apart, and to verify its analytical form (known as the ” tree of analysis ” in Computer Science, and as “deep structure” in generative grammar ).
Introductory Example
A grammar consists mainly of a set of rules for string transformation. (If it only consisted of these rules, it would be considered a semi-Thue system .) To generate a string in language, you start with a string consisting of only one leading symbol . The production rules are then applied in any order until a chain that contains neither the start symbol nor the so – called nonterminal symbols., be produced. A production rule is applied to a string by replacing an occurrence on the left side of the grammar with one on the right. The language formed by grammar consists of all the distinct strings that can be generated in this way. Any particular sequence of production rules about the initial symbol produces a distinct chain in language. If there are several ways to generate the same string, the grammar is called ambiguous.
For example, we assume that the alphabet consists of a and b , the initial symbol is S , and we have the following rules of production:
- 1.
- 2.
So we start with S , and we can choose a rule to apply to it. If we chose rule 1, we would get the string aSb . If we choose the rule 1 again, we would replace the S ‘s aSb by aSb , and we would get the chain aaSbb . Now we choose rule 2, we would replace the S ‘s aaSbb by ba and we would get the chain aababb, ending the process. We can write this series of choices more succinctly using symbols: . The language grammar is then set to infinity , wherein the k isa repeated k times (and n in particular represents the number of times production rule 1 has been applied).
Formal Definition
The Grammar Syntax
In the formalization of generative grammars first proposed by Noam Chomsky in the 1950s, [ 1 ] [ 2 ] a G grammar was composed of the following components:
- A finite set N of non-terminal symbols , which is disjoint from the chains formed from L .
- A finite set of terminal symbols that is disjoint from C .
- A finite set P of production rules , each with the form where it is the Kleene operator and denotes the union between sets . This means that each production rule maps one symbol chain to another, where the first contains an arbitrary number of symbols, at least one of which is nonterminal. The second string would consist only of the empty string – that is, they contain no symbols – and is sometimes represented with special notations to avoid confusion ( , and or ).
- A distinguished symbol that is the initial symbol , also called the sentence symbol .
A grammar is formally defined as a tuple . Such formal grammar is sometimes referred to as the rewrite system or phrase structure grammar in literature. [ 3 ] [ 4 ]
Formal grammar in theoretical linguistics
A formal grammar is a mathematical model (more exactly an algebraic structure) composed of a series of syntactic categories that are combined with each other by means of syntactic rules that define how a syntactic category is created by means of other or grammar symbols. There are several types of historically important formal grammars:
- The categorical formal grammars ( C-grammars ) using a bottom-up analysis and require the use of tags for each category or sequence formed syntactic constituent itself. There is a single superior category that denotes complete and valid strings.
- The syntagmatic structure grammars ( ES-grammars , in English PS-grammars ) based on rewriting rules and with a top-down analysis. Like C-grammars, they are based on the notion of a syntactic constituent.
- The associative grammars (from left) ( A-grammars , English LA-grammars ), which uses uses a bottom-up analysis, which allow analysis of linear complexity , but ignore the concept of syntactic constituent.
The first two types have obvious connection points with the notion of syntactic constituency and analysis using syntax trees . However, the syntactic analyzers for the sentences formed according to them cannot be based on the rules of generation (speaking-listening asymmetry), which suggests that they cannot be good models of the speakers’ intuition. In addition, natural language models based on them seem to have a polynomial or exponential complexity, which does not seem to agree with the speed with which speakers process natural languages. In contrast, A-Grammars generally have linear complexity, symmetry between speakers and listeners, however, they ignore the classic constituents of syntactic analysis . However, they are still used for the parsers used in computing.
By means of these constituent elements a mechanism of specification is defined consisting of repeating the mechanism of substitution of a category by its constituents according to the rules beginning with the superior category and ending when the sentence no longer contains any category. In this way, grammar can generate or produce each of the corresponding language strings and only these strings
Limitation of formal grammars
The ES-grammar as used in the first models of generative grammar require certain restrictions to be computationally treatable. To understand this restriction, the interaction between a speaker and a listener must be considered, the first generates a sentence or sequence according to the grammar rules, the second to understand this sequence must analyze the sequence to understand it, finding the formative elements, interpreting them and rebuilding the relationship between them (internal structure). For that second to be possible, the internal structure must have a sufficiently simple structure to be able to parse the sequences with a low degree of ambiguity. Well, computationally it has been found that the complexity class against the reverse analysis of certain grammars is excessive.
Formal grammar in mathematics and logic
Within the formalistic and axiomatic approach of mathematics it was conceived that certain areas of mathematics could be conceived as a logical-deductive system of formulas subject to manipulation restrictions. The formal grammar of these systems would be the set of combinatorial rules according to certain deductive principles.
A formal language in logic or mathematics is a triplet {\ displaystyle \ scriptstyle (A, R_ {F}, {\ mathcal {A}}, {\ mathcal {D}})} where {\ displaystyle \ scriptstyle A} denotes the alphabet or set of signs used, the rule set {\ displaystyle \ scriptstyle R_ {F}}It explains which combinations of signs are well defined and allows defining what a well formed formula is (in that sense{\ displaystyle \ scriptstyle R_ {F}}defines the morphology of the words of the formal language). The set of well-formed formulas constitute the vocabulary or lexicon, while the pair{\ displaystyle \ scriptstyle ({\ mathcal {A}}, {\ mathcal {D}})}describes the set of axioms and the set of valid deduction rules. These last two allow to establish sequences of well-formed formulas (words of formal language) that constitute valid demonstrations within the formal system (they are somehow equivalent to the syntax of the formal language).