Difference between revisions of "Regular Expressions"

Latest revision as of 09:48, 23 August 2023

A regular expression is a notation for defining all the valid strings of a formal language.

https://www.youtube.com/watch?v=FNeaf1zm01g&index=4&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B

TRC Video

https://www.youtube.com/watch?v=n2de-NKWBSA

Examples of Regular Expression Notation

Regular Expression	Meaning
a	Matches a string consisting of just the symbol a
b	Matches a string consisting of just the symbol b
ab	Matches a string consisting of the symbol a followed by the symbol b
a*	Matches a string consisting of zero or more a’s
a+	Matches a string consisting of one or more a’s
abb?	Matches the string ab or the string abb. The ? symbol indicates zero or one of the preceding element
a\|b	Matches a string consisting of the symbol a or the symbol b

Ways to Remember the Symbols

'?' is questioning if it is there or not, so zero or one of the preceding elements.

'+' suggests you have one already and you want to add more, so its 1 or more of the preceding elements.

'*' is often used as a wildcard character and suggests whatever and anything goes, so zero or more of the preceding elements.

Precedence Rules

When using regular expressions, the rules of arithmetic precedence are as follows:

+ and * and so on are done first

Concatenation (ie joining elements together) is done next ie Brackets

| comes last

Examples

So 'ab+' will mean the '+' will only operate on the 'b'.

The '+' will be evaluated first & then the joining.

if you wanted atleast 1 'ab' pattern you would need to use brackets to get '(ab)+' .

So 'ab|cd' will join 'ab' and 'cd' before looking at the '|'.

If you wanted an 'a' followed by a 'b' or 'c', and finishing with a 'd' you will need to use brackets to get 'a(b|c)d' .

More Examples

Examples of regular expressions using the alphabet {a, b, c}

abc defines the language with only the string ‘abc’
abc | cba defines the language with two strings’ abc’ and ‘cba’
(a | b) c (a | b) gives four strings: ‘aca’, ‘acb’, ‘bca’, ‘bcb’
a+ gives an infinite number of strings: ‘a’, ‘aa’, ‘aaa’, etc
ab* gives an infinite number of strings: ‘a’, ‘ab’, ‘abb’, ‘abbb’, etc
(ab)* gives an infinite number of strings: ‘’, ‘ab’, ‘abab’, ‘ababab’, etc
(a | c)+ gives all possible strings of a and c (not including the empty string)

Regular expression meta-characters

You are expected only expected to know ? * + | and the use of brackets, the specification is limited just to these characters.

Symbol	Meaning	Example
│	Used to separate alternatives	a│b (Means a or b)
?	Used to denote zero or one of the preceding element	a? (0 or 1 as; matches with ‘’ & ‘a’)
*	Used to denote zero or more of the preceding element	a* (0 or more as; matches with ‘’, ‘a’, ‘aa’, etc.)
+	Used to denote one or more of the preceding element	a+ (1 or more as; matches with ‘a’, ‘aa”’etc.)
( )	Used to group characters together, to indicate the scope of another operator	(ab)* (Example 0 or more abs; matches with ‘’, ‘ab’, ‘abab’, etc.
[ ]	Another way of denoting alternatives (instead of vertical bar). Defines a character class	[ab] (means a or b)
\	The escape character (this turns the metacharacter into an ordinary character)	a\* (the a character followed by the * character. Note: \ is needed as a* would mean zero or more as.)
^	Used to indicate the negation of a character class. Also used to match the position before the first character in a string	a[^bc] (a followed by a character that is not a b or c) ^abc will match with abc only if it is at the beginning of a string
$	Used to match with the position after the last character in a string	abc$ (will match with abc only if it is at the end of a string)
.	Matches with any single character	a.a (will match with any string that has an a followed by any character followed by an a e.g. ‘aca’, ‘aba’)
-	Used to specify a range of values in a character class	[A-Z] (character in the range of A to Z)

Remember you are expected to know ? * + | and the use of brackets, the others should be explained to you in the question.

Escape Characters

\ is the escape character. If it precedes a symbol, it means symbol has no function:

a\* means it will match exactly strings of ‘a*’

Previous exam questions have used escape characters, they have explained what the specific escape characters mean. They used them for digits, ie \a meant any alphabetic character & \n meant any numeric digit.

C# Regex Example

You will need to add to the using section of your program, we need to import System.Text.RegularExpressions. This will return partial matches as well as full matches. For example 'ab+' and 'aba' will return a match because it contains 'ab' even though it isn't a full match. To fix this use the '^' at the start to signify it must start with 'ab+' and the '$' at the end to signify it must end with 'ab+'.

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        Regex regex = new Regex(@"^ab+$");
        Match match = regex.Match("abb");
        if (match.Success)
        {
            Console.WriteLine("MATCH VALUE: " + match.Value);
        }
    }
}

Regular Language

https://www.youtube.com/watch?v=qGfPe2g8VOs&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B&index=5

A regular language is a formal language that can be accepted by a finite state machine. Regular Expressions can also be specified using a FSM, an example question:

The FSM in below defines the language that allows all strings containing at least, either two consecutive 1s or two consecutive 0s.
The strings 0110, 00 and 01011 are all accepted by the FSM and so are valid strings in the language.
The strings 1010 and 01 are not accepted by the FSM and so are not valid strings in the language.

For 3 marks you need to write the regular expression that matches this FSM.

@@ Line 1: / Line 1: @@
+__NOTOC__
 A regular expression is a notation for defining all the valid strings of a formal language.
+<youtube>https://www.youtube.com/watch?v=FNeaf1zm01g&index=4&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B</youtube>
+https://www.youtube.com/watch?v=FNeaf1zm01g&index=4&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B
+===TRC Video===
+<youtube>n2de-NKWBSA</youtube>
+https://www.youtube.com/watch?v=n2de-NKWBSA
 ==Examples of Regular Expression Notation==
@@ Line 21: / Line 31: @@
 |}
+==Ways to Remember the Symbols==
+'?' is questioning if it is there or not, so zero or one of the preceding elements.
+'+' suggests you have one already and you want to add more, so its 1 or more of the preceding elements.
+'*' is often used as a wildcard character and suggests whatever and anything goes, so zero or more of the preceding elements.
+==Precedence Rules==
 When using regular expressions, the rules of arithmetic precedence are as follows:
-*+ and * are done first
-*Concatenation (ie joining elements together) is done next
-*| comes last
++ and * and so on are done first
+Concatenation (ie joining elements together) is done next ie Brackets
+| comes last
+===Examples===
+So 'ab+' will mean the '+' will only operate on the 'b'.
+The '+' will be evaluated first & then the joining.
+if you wanted atleast 1 'ab' pattern you would need to use brackets to get '(ab)+' .
+So 'ab|cd' will join 'ab' and 'cd' before looking at the '|'.
+If you wanted an 'a' followed by a 'b' or 'c', and finishing with a 'd' you will need to use brackets to get 'a(b|c)d' .
+==More Examples==
 Examples of regular expressions using the alphabet {a, b, c}
 *abc defines the language with only the string ‘abc’
@@ Line 35: / Line 70: @@
 *(a | c)+ gives all possible strings of a and c (not including the empty string)
-Regular expression meta-characters
+==Regular expression meta-characters==
+'''You are expected only expected to know ? * + | and the use of brackets, the specification is limited just to these characters.'''
+{| class="wikitable"
+|-
+! Symbol !! Meaning !! Example
+|-
+|<nowiki>│</nowiki>||	Used to separate alternatives || a│b (Means a or b)
+|-
+|?	||Used to denote zero or one of the preceding element	||a? (0 or 1 as; matches with ‘’ & ‘a’)
+|-
+|*	||Used to denote zero or more of the preceding element	||a* (0 or more as; matches with ‘’, ‘a’, ‘aa’, etc.)
+|-
+| +	||Used to denote one or more of the preceding element	||a+ (1 or more as; matches with ‘a’, ‘aa”’etc.)
+|-
+|( )	||Used to group characters together, to indicate the scope of another operator	||(ab)* (Example 0 or more abs; matches with ‘’, ‘ab’, ‘abab’, etc.
+|-
+|[ ]	||Another way of denoting alternatives (instead of vertical bar). Defines a character class	||[ab] (means a or b)
+|-
+|\	||The escape character (this turns the metacharacter into an ordinary character)	||a\* (the a character followed by the * character. Note: \ is needed as a* would mean zero or more as.)
+|-
+|^	||Used to indicate the negation of a character class. Also used to match the position before the first character in a string || a[^bc] (a followed by a character that is not a b or c) ^abc will match with abc only if it is at the beginning of a string
+|-
+|$	||Used to match with the position after the last character in a string	||abc$ (will match with abc only if it is at the end of a string)
+|-
+|.	||Matches with any single character	||a.a (will match with any string that has an a followed by any character followed by an a e.g. ‘aca’, ‘aba’)
+|-
+| -	||Used to specify a range of values in a character class	||[A-Z] (character in the range of A to Z)
+|}
+Remember you are expected to know ? * + | and the use of brackets, the others should be explained to you in the question.
+==Escape Characters==
+\ is the escape character. If it precedes a symbol, it means symbol has no function:
+a\* means it will match exactly strings of ‘a*’
+Previous exam questions have used escape characters, they have explained what the specific escape characters mean. They used them for digits, ie \a meant any alphabetic character & \n meant any numeric digit.
+==C# Regex Example==
+You will need to add to the using section of your program, we need to import System.Text.RegularExpressions. This will return partial matches as well as full matches. For example 'ab+' and 'aba' will return a match because it contains 'ab' even though it isn't a full match. To fix this use the '^' at the start to signify it must start with 'ab+' and the '$' at the end to signify it must end with 'ab+'.
+<syntaxhighlight lang=c#>
+using System;
+using System.Text.RegularExpressions;
+class Program
+{
+    static void Main()
+    {
+        Regex regex = new Regex(@"^ab+$");
+        Match match = regex.Match("abb");
+        if (match.Success)
+        {
+            Console.WriteLine("MATCH VALUE: " + match.Value);
+        }
+    }
+}
+</syntaxhighlight>
-Symbol	Meaning	Example
+==Regular Language==
-│	Used to separate alternatives	a│b
+<youtube>https://www.youtube.com/watch?v=qGfPe2g8VOs&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B&index=5</youtube>
-Means a or b
-?	Used to denote zero or one of the preceding element	a?
+https://www.youtube.com/watch?v=qGfPe2g8VOs&list=PLCiOXwirraUAnbNTfWFxkoq5MoIair49B&index=5
-or 1 as; matches with ‘’ & ‘a’
-*	Used to denote zero or more of the preceding element	a*
+A regular language is a formal language that can be accepted by a finite state machine. Regular Expressions can also be specified using a FSM, an example question:
-or more as; matches with ‘’, ‘a’, ‘aa’, etc.
-+	Used to denote one or more of the preceding element	a+
+ The FSM in below defines the language that allows all strings containing at least, either two consecutive 1s or two consecutive 0s.
-or more as; matches with ‘a’, ‘aa”’etc.
+ The strings 0110, 00 and 01011 are all accepted by the FSM and so are valid strings in the language.
-( )	Used to group characters together, to indicate the scope of another operator	(ab)*
+ The strings 1010 and 01 are not accepted by the FSM and so are not valid strings in the language.
-or more abs; matches with ‘’, ‘ab’, ‘abab’, etc.
-[ ]	Another way of denoting alternatives (instead of vertical bar). Defines a character class	[ab]
+[[File:Regexfsm.png]]
-means a or b
-\	The escape character (this turns the metacharacter into an ordinary character)	a\*
-the a character followed by the * character. Note: \ is needed as a* would mean zero or more as.
-^	Used to indicate the negation of a character class
-Also used to match the position before the first character in a string
+For 3 marks you need to write the regular expression that matches this FSM.
-	a[^bc]
-a followed by a character that is not a b or c
-^abc
-will match with abc only if it is at the beginning of a string
-$	Used to match with the position after the last character in a string	abc$
-will match with abc only if it is at the end of a string
-.	Matches with any single character	a.a
-will match with any string that has an a followed by any character followed by an a e.g. ‘aca’, ‘aba’
--	Used to specify a range of values in a character class	[A-Z]
-character in the range of A to Z

Difference between revisions of "Regular Expressions"

Latest revision as of 09:48, 23 August 2023

TRC Video

Examples of Regular Expression Notation

Ways to Remember the Symbols

Precedence Rules

Examples

More Examples

Regular expression meta-characters

Escape Characters

C# Regex Example

Regular Language

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Main Page

AL Paper 1

AL Paper 2

Project

Tools

Changes