String Searching is a subject of both theoretical and practical interest in computer science. String-matching is a very important subject in the wider domain of text processing. String-matching algorithms are basic components used in implementations of practical software existing under most operating systems. Moreover, they emphasize programming methods that serve as paradigms in other fields of computer science (system or software design). Finally, they also play an important role in theoretical computer science by providing challenging problems and can be extended to areas such as
* pattern matching/recognition
* molecular biology, comparative genomics, …
* information retrieval
* data/text mining
* data/text compression, coding, encryption
* string processing in large databases
A string is a sequence of symbols drawn from some well defined set called the alphabet.
Examples of alphabets include:
* ASCII code, Unicode
* binary alphabet {0,1}
* System of DNA base-pairs {A,C,G,T}
* Latin, Greek, Chinese alphabet
Examples of strings:
* Java/C/ADA programs, HTML/XML documents
* DNA sequences
* image/video/audio files
String-matching consists of finding an exact match, or more generally, a pattern in a text. We shall examine this using Finite Automata as an implementation of regular expression matching.
Regular Expression Matching Algorithms
There are two fundamental different types of regex engines: one called DFA (Deterministic Finite Automata) and one called NFA (Non-Deterministic Finite Automata). Before a discussion of the algorithms are made, we shall first cover some background information about Finite State Machines.
Mathematical Background
The theory of Finite Automata can be classified into several categories, but the one we need for the sake of regular expression recognition is the notion of determinism. Something is deterministic when it involves no chance - everything is known and can be prescribed and simulated beforehand. On the other hand, non-determinism is about chance and probabilities. It is commonly defined as "A property of a computation which may have more than one result".
An NFA N consists of:
1. A finite set I of input symbols
2. A finite set S of states
3. A next-state function f from S x I into P(S)
4. A subset Q of S of accepting states
5. An initial state s0 from S
Denoted as N(I, S, f, Q, s0)
A DFA is a special case of NFA in which:
1. There are no Epsilon transitions
2. All transitions leading away from a state are done on different input characters
3. All states must have a valid transition on an input character
Regex-Directed Versus Text-Directed
The NFA and DFA reflect fundamental differences in algorithms for applying a regular expression to a string. The NFA is “regex-directed” and the DFA is “text-directed”.
During the course of an NFA match, the same character of the target might be checked by many different parts of the regex. Even if a sub-expression can match, it might be applied again and again as it works in concert with the rest of the regex to find a match. This is due to Backtracking mechanism of a NFA match.
The implications of NFA usage are that the details of how a match is made are very important. The writer of a regex can exercise more control simply by changing how the regex is written.
A DFA engine behaves oppositely. The engine keeps track of all matches simultaneously. There could be multiple ways to achieve the same result, but since the DFA keeps track of all them simultaneously. The DFA is essentially a “normalized” NFA. It is due to this normalization that the DFA gains in efficiency over NFA with a slightly added level of complexity in implementation.
While a DFA may be more efficient, an NFA engine can support many things that a DFA cannot. Among them are:
* Capturing text matched by a parenthesized sub-expression
* After-match information indicating where in the matched text each parenthesized sub-expression matched
* Look-around, and other complex zero-width assertions
* Possessive quantifiers and atomic grouping
Example Implementation of an NFA
Outside of the string matching world, NFA’s are used in theory and in practice for several concepts and physical machines that make use of this “abstract computer machine”. Attached is source code for you to examine. Note that common regex functionality such as Kleene Closures, Unions, and Intersections have not been implemented; and construction of the NFA has to be done manually.