I have been given an assignment focused on regular expressions (regexp) and parsing. This assignment is split into two parts. A parser and compiler that converts a regular expression into a set of Finite State Machines (FSM) that can then be loaded into a second piece of software that searches a piece of text for patterns that match the regular expression. I will work on the first part and my partner will work on the second.
The parser and compiler will adhere to a set of rules that have been laid out for us in the assignment document. An expression will then be run through that set of rules and the program will output a set of FSM that the second piece can then read. This will require format to be agreed with my partner but I feel this should be relatively simple. The rules for this compiler are as follows:
Rules
- Any non-special character is a literal
- . (period) is a wild card that can be any literal
- adjacent regexps are concatenated to a single regexp
- * indicates closure (zero or more) on the preceding regexp
- ? indicates that the preceding regexp can occur zero or one times
- | is an infix alternation operator such that if r and e are regexps r|e means r or e
- ( and ) may enclose a regexp to raise its precedence such that (e) is a regexp that is equivilent to e
- \ is an escape character that matches λ but indicates that the following character loses any special meaning and is a literal
- Operator precedence is as follows:
- Escaped characters
- Paretheses
- Repetition
- Concatenation
- Alternation
- Optional: ! not operator
An example of this would be (r | e) a*:
State | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
Type | br | br | r | e | br | a | λ |
Next 1 | 1 | 2 | 4 | 4 | 5 | 4 | . |
Next 2 | 1 | 3 | . | . | 6 | . | . |
I will continue to work on this over the next few weeks and provide updates as I go along with the development process.