Formal Languages

Languages (Linguagem) are a fundamental concept in computer science, defined by an alphabet (alfabeto) and words (palavras, cadeia de caracteres). The alphabet is defined as a basic abstract entity, containing a finite amount of letters and digits. Therefore, the infinite set is NOT an alphabet. However, the empty set ∅ is an alphabet.

  An example of alphabet is the set {a, b, c}. A more practical example of an alphabet is the set of all symbols used to write a program. Writing a C program, for example, includes letters, numbers, special characters like ">", "/", etc. The binary alphabet is the domain of values for a bit, such as {0, 1}. It is analogous to the inner workings of real computers.

  While an alphabet, represented by Σ (sigma) is a finite sequence of symbols, a symbol-less string would be represented by "ε" (Epsilon). This symbol represents an empty string or word, as well as the neutral element (elemento neutro) in formal languages, meaning it does not change words in any way. Prefixes and sufixes are examples of subwords, or strings of symbols contained within a larger string / word. With the word "abcb" as an example, made with the alphabet {a, b, c}, we have the prefixes "ε", "a", "ab", "abc" and "abcb".

Concatenation is a binary operation on a set of words. It associates a pair of words, forming a new one from the juxtaposition of the first with the second. Concatenating any word with ε results in the same word, unchanged. Concatenation is also associative, so concatenating the words " v(wt) " and " (vw)t " will yield the same result. The parentheses may be omitted.

  If we have two words, "v = baa" and "w = bb" from the alphabet Σ = {a, b}, then the concatenation of two will look like such: vw = baabb. Other interactions with concatenation include:

  • wn will concatenate w with itself n times. Can be represented recursively as wn = wwn-1, n > 0
  • w0 = ε, the neutral element
  • a5 = aaaaa, a being a symbol of an alphabet.

  If we have an alphabet Σ, then:

  • Σ* is the set of all possible words on Σ
  • Σ+ = Σ* - {ε}
  The alphabet Σ* is inductively defined. The base of the induction is: ε ∈ Σ* -> for every x ∈ Σ, x ∈ Σ*. Each step of the induction is: If u and v are words of Σ*, then the concatenation uv is a word of Σ*. Thus, the set of all words can be formed by the alphabet Σ. Example:
    if Σ = {a, b}, then:
  • Σ+ = {a, b, aa, ab, ba, bb, aaa, aab, ...}
  • Σ* = {ε, a, b, aa, ab, ba, bb, aaa, aab, ...}

  The length of a word is represented by |w|. It is a function of domain (domínio) Σ* and range (codomínio) N. It returns the number of symbols in a word. For example: |abcb| = 4 and |ε| = 0.

  A formal language is a component to an alphabet. The set of parts of Σ*, 2Σ*, represents the set of all languages on an alphabet. An example of language is every program made in Pascal, or C, or Python, for example. The set of all programs is infinite.

Grammar (gramática) is a finite set of rules which, when applied successively, generate words. The set of all words generated by a grammer defines a language. For example, with the rules of C programming, you can generate programs in that language. It applies to programming languages the same way it does to natural languages, like English. Chomsky's Grammar is, then:

    G = (V, T, P, S)
  • V is the finite set of variable / non-terminal symbols.
  • T is the finite set of terminal symbols of V.
  • P is the union of V and T, a finite relation.
  • S, distinct of V, is the initial symbol / variable.

Derivation (derivação) is the substitution of a subword according to a rule of production. It is denoted by an arrow, ⇒, with a domain in (V∪T)+ and range in (V∪T)*. ⇒* denotes a transitive and reflexive relation, ⇒+ denotes a transitive relation, ⇒i denotes i steps of successive derivation.

  if G = (V, T, P, S), then we can use the example:

  • V = {N, D}
  • T = {0, 1, 2, ..., 9}
  • P = {N → D, N → DN, D → 0 | 1 | ... | 9}
  This way, the set of natural numbers is syntactically-generated, although leftmost 0s are distinct (123 is different to 0123). Another example that can be shown is the derivation of the number 243:
  • N ⇒   N → DN
  • DN ⇒   D → 2
  • 2N ⇒   N → DN
  • 2DN ⇒   D → 4
  • 24N ⇒   N → D
  • 24D ⇒   D → 3
  • 243
  • Therefore: S ⇒* 243 and S ⇒+ 243.

Automatons

  Finite Automatons (Autômatos Finitos) are state machines with a finite number of states. They are efficient algorithms that read inputs one symbol at a time. Any finite automaton is equally efficient. Each state in a state machine contains only previous information, necessary to determine the next step.

  States in a finite automaton may be Non-deterministic (Não-determinista) when they output a random value. The random choice can be within a system or outside of it, but its main point is helping to reduce the complexity of an automaton for easier understanding. Non-deterministic automatons can also perform "empty moves", represented by ε, usually symbolizing that a state machine will connect to another one at a random moment, without changing the inputted string. Deterministic Automatons (AFDs) can become Non-Deterministic Automatons (AFNs), and vice versa.

  The input of an Automaton is the "tape" (fita), consisting of a finite amount of symbols to be processed. The Automaton will read each symbol, one at a time, and determine what the next state will be based on the previous one. The symbols of the tape follow the rules of Formal Languages, such as belonging to an alphabet, forming words, etc. The tape is read by the Automaton's control unit. Automatons are not capable of storing any data, so its "memory" is simply the current state.

  A Partial Function (Função Parcial), is a state machine in which not all states and transitions are plotted out, such as a state machine going from state 1 to state 2 with input A, with no consideration for other inputs.

M = (Σ, Q, δ, q0, F) represents a Finite Automaton, where:

  • Σ is the input alphabet.
  • Q is the set of possible states of the automaton.
  • δ (delta) is a program (function) or transition function (parcial function).   δ: Q x Σ → Q
  • q0 is a distinct element of Q, the initial state.
  • F is a subset of Q, the set of final states.

A simple automaton diagram
The initial state and final state representations, as well as transition examples
A table representing the automaton that reaches its final state if it contains the subword "aa" or "bb"
The same automaton as before, now represented as a diagram

  It is important to note that, in an automaton, if a word does not reach a final state by the end of its processing, it is not part of that grammar. The automaton accepts an input if, after processing the last symbol, it assumes a final state, rejecting it otherwise. Any finite automaton can be viewed as a finite graph (grafo).

The function δ* represents a δ function extended for entire words, defined inductively, where:
  • δ*(q, ε) = q, this being the end of a recursive sequence.
  • δ*(q, aw) = δ*(δ(q, a), w), meaning it recursively travels across the word.
A sequence of lines showing off the recursive function at work. Note that the "q" changes in number based on the current state of the machine.

  The sets of accepted and rejected words in an automaton can be represented as:

    M = (Σ, Q, δ, q0, F)
  • Accepted(M) ∩ Rejected(M) = ∅ (empty set)
  • Accepted(M) ∪ Rejected(M) = Σ* (all possible words formed from an alphabet)
  • -Accepted(M) = Rejected(M)
  • -Rejected(M) = Accepted(M)
  • Note: if two distinct automatons' Accepted(M) sets are equal, they are Equivalent Finite Automatons.

  If an automaton has no final state, its language will be L = ∅. If an automaton's only state is a final state, then its language will be L = Σ*.

An automaton whose language contains words with an even amount of As and Bs. A negation of this automaton would include either an odd amount of As, an odd amount of Bs, or an odd amount of both.
Graphs showing the computations vs the paths of a graph.

  When it comes to Non-Deterministic Automatons, they may not always be more efficient. They are generally for the study of formal languages, used to determine the capacity to recognize languages and solve problems. Non-determinism in a program is a partial function, such as the transition from one state to a random one when assuming a value. A non-deterministic automaton accepts the entry after processing the last symbol in the tape and there exists at least one final state among the set of alternative states reached. It will reject the entry if all alternative states reached are non-final, or the program is undefined for the argument.

A non-deterministic automaton, where p can transition to 3 random states.