Helo guys, kali ini saya akan membahas Regular Expressions salah satu menu favorit saya dalam dunia permrograman. Karena sudah banyak project saya kerjakan dengan menggunakan metode ini untuk parsing dan ekstraksi data.
Regular expression adalah metode untuk mencocokkan sebuah pola/pattern dari sebuah text. Dalam .Net framework mempunyai regular expression engine yang dapat digunakan untuk melakukan pencocokan dan ekstraksi text sesuai dengan pola/pattern yang kita inginkan.
Membuat dan Mendefinisikan Regular Expressions
Ada beberapa kombinasi dalam membuat sebuah pola/pattern pada regular expressions yaitu berupa karakter dan operator.
1. Character Escape
Merupakan spesial karakter yang digunakan untuk menyatakan sebuah ekpresi filtering, berikut adalah daftar tabel untuk character escape:
Escape character | Description | Pattern | Matches |
\a | Matches a bell character, \u0007. | \a | "\u0007" in "Warning!" + '\u0007' |
\b | In a character class, matches a backspace, \u0008. | [\b]{3,} | "\b\b\b\b" in "\b\b\b\b" |
\t | Matches a tab, \u0009. | (\w+)\t | "Name\t", "Addr\t" in "Name\tAddr\t" |
\r | Matches a carriage return, \u000D. (\r is not equivalent to the newline character, \n.) | \r\n(\w+) | "\r\nHello" in "\r\Hello\nWorld." |
\v | Matches a vertical tab, \u000B. | [\v]{2,} | "\v\v\v" in "\v\v\v" |
\f | Matches a form feed, \u000C. | [\f]{2,} | "\f\f\f" in "\f\f\f" |
\n | Matches a new line, \u000A. | \r\n(\w+) | "\r\nHello" in "\r\Hello\nWorld." |
\e | Matches an escape, \u001B. | \e | "\x001B" in "\x001B" |
\nnn | Uses octal representation to specify a character (nnn consists of up to three digits). | \w\040\w | "a b", "c d" in "a bc d" |
\x nn | Uses hexadecimal representation to specify a character (nn consists of exactly two digits). | \w\x20\w | "a b", "c d" in "a bc d" |
\c X\c x | Matches the ASCII control character that is specified by X or x, where X or x is the letter of the control character. | \cC | "\x0003" in "\x0003" (Ctrl-C) |
\u nnnn | Matches a Unicode character by using hexadecimal representation (exactly four digits, as represented by nnnn). | \w\u0020\w | "a b", "c d" in "a bc d" |
\ | When followed by a character that is not recognized as an escaped character, matches that character. | \d+[\+-x\*]\d+\d+[\+-x\*\d+ | "2+2" and "3*9" in "(2+2) * 3*9" |
2. Character Classes
Character classes digunakan untuk menyatakan seabuah set atau sekumpulan dari character. Berikut ini adalah character classes:
Character class | Description | Pattern | Matches |
[character_group] | Matches any single character in character_group. By default, the match is case-sensitive. | [mn] | "m" in "mat" "m", "n" in "moon" |
[^character_group] | Negation: Matches any single character that is not in character_group. By default, characters incharacter_group are case-sensitive. | [^aei] | "v", "l" in "avail" |
[ first - last ] | Character range: Matches any single character in the range from first to last. | [b-d] | [b-d]irds Birds Cirds Dirds |
. | Wildcard: Matches any single character except \n. | a.e | "ave" in "have" "ate" in "mate" |
\p{ name } | Matches any single character in the Unicode general category or named block specified by name. | \p{Lu} | "C", "L" in "City Lights" |
\P{ name } | Matches any single character that is not in the Unicode general category or named block specified by name. | \P{Lu} | "i", "t", "y" in "City" |
\w | Matches any word character. | \w | "R", "o", "m" and "1" in "Room#1" |
\W | Matches any non-word character. | \W | "#" in "Room#1" |
\s | Matches any white-space character. | \w\s | "D " in "ID A1.3" |
\S | Matches any non-white-space character. | \s\S | " _" in "int __ctr" |
\d | Matches any decimal digit. | \d | "4" in "4 = IV" |
\D | Matches any character other than a decimal digit. | \D | " ", "=", " ", "I", "V" in "4 = IV" |
3. Anchor
Anchor digunakan untuk mencocokkan pola/pattern berdasarkan posisi character saat ini. Berikut ini adalah anchor pada Regular Expressions.
Assertion | Description | Pattern | Matches |
^ | The match must start at the beginning of the string or line. | ^\d{3} | "567" in "567-777-" |
$ | The match must occur at the end of the string or before \nat the end of the line or string. | -\d{4}$ | "-2012" in "8-12-2012" |
\A | The match must occur at the start of the string. | \A\w{3} | "Code" in "Code-007-" |
\Z | The match must occur at the end of the string or before \n at the end of the string. | -\d{3}\Z | "-007" in "Bond-901-007" |
\z | The match must occur at the end of the string. | -\d{3}\z | "-333" in "-901-333" |
\G | The match must occur at the point where the previous match ended. | \\G\(\d\) | "(1)", "(3)", "(5)" in "(1)(3)(5)[7](9)" |
\b | The match must occur on a boundary between a \w(alphanumeric) and a \W(nonalphanumeric) character. | \w | "R", "o", "m" and "1" in "Room#1" |
\B | The match must not occur on a \b boundary. | \Bend\w*\b | "ends", "ender" in "end sends endure lender" |
4. Grouping Constructs
Merupakan pencocokan berdasarkan set group dari character atau substring. Berikut ini adalah daftar grouping constructs.
Grouping construct | Description | Pattern | Matches |
( subexpression ) | Captures the matched subexpression and assigns it a zero-based ordinal number. | (\w)\1 | "ee" in "deep" |
(?< name >subexpression) | Captures the matched subexpression into a named group. | (?< double>\w)\k< double> | "ee" in "deep" |
(?< name1 -name2 >subexpression) | Defines a balancing group definition. | (((?'Open'\()[^\(\)]*)+((?'Close-Open'\))[^\(\)]*)+)*(?(Open)(?!))$ | "((1-3)*(3-1))" in "3+2^((1-3)*(3-1))" |
(?: subexpression) | Defines a noncapturing group. | Write(?:Line)? | "WriteLine" in "Console.WriteLine()" |
(?imnsx-imnsx:subexpression) | Applies or disables the specified options within subexpression. | A\d{2}(?i:\w+)\b | "A12xl", "A12XL" in "A12xl A12XL a12xl" |
(?= subexpression) | Zero-width positive lookahead assertion. | \w+(?=\.) | "is", "ran", and "out" in "He is. The dog ran. The sun is out." |
(?! subexpression) | Zero-width negative lookahead assertion. | \b(?!un)\w+\b | "sure", "used" in "unsure sure unity used" |
(?< =subexpression) | Zero-width positive lookbehind assertion. | (?< =19)\d{2}\b | "51", "03" in "1851 1999 1950 1905 2003" |
(?< ! subexpression) | Zero-width negative lookbehind assertion. | (?< !19)\d{2}\b | "ends", "ender" in "end sends endure lender" |
(?> subexpression) | Nonbacktracking (or "greedy") subexpression. | [13579](?>A+B+) | "1ABB", "3ABB", and "5AB" in "1ABB 3ABBC 5AB 5AC" |
5. Quantifiers
Quantifiers menyatakan berapa banyak element (character, group atau character classes) yang muncul pada sebuah string/text.
Quantifier | Description | Pattern | Matches |
* | Matches the previous element zero or more times. | \d*\.\d | ".0", "19.9", "219.9" |
+ | Matches the previous element one or more times. | "be+" | "bee" in "been", "be" in "bent" |
? | Matches the previous element zero or one time. | "rai?n" | "ran", "rain" |
{ n } | Matches the previous element exactly n times. | ",\d{3}" | ",043" in "1,043.6", ",876", ",543", and ",210" in "9,876,543,210" |
{ n ,} | Matches the previous element at least n times. | "\d{2,}" | "166", "29", "1930" |
{ n , m } | Matches the previous element at least n times, but no more than m times. | "\d{3,5}" | "166", "17668" "19302" in "193024" |
*? | Matches the previous element zero or more times, but as few times as possible. | \d*?\.\d | ".0", "19.9", "219.9" |
+? | Matches the previous element one or more times, but as few times as possible. | "be+?" | "be" in "been", "be" in "bent" |
?? | Matches the previous element zero or one time, but as few times as possible. | "rai??n" | "ran", "rain" |
{ n }? | Matches the preceding element exactly n times. | ",\d{3}?" | ",043" in "1,043.6", ",876", ",543", and ",210" in "9,876,543,210" |
{ n ,}? | Matches the previous element at least n times, but as few times as possible. | "\d{2,}?" | "166", "29", "1930" |
{ n , m }? | Matches the previous element between n and m times, but as few times as possible. | "\d{3,5}?" | "166", "17668" "193", "024" in "193024" |
6. Backreference Constructs
Digunanakan untuk mencocokkan urutan sub expression pada sebuah regular expressions yang sama.
Backreference construct | Description | Pattern | Matches |
\ number | Backreference. Matches the value of a numbered subexpression. | (\w)\1 | "ee" in "seek" |
\k< name > | Named backreference. Matches the value of a named expression. | (?< char>\w)\k< char> | "ee" in "seek" |
7. Alternation Constructs
Digunakan untuk mencocokkan beberapa pattern, sehingga pencocokan mempunyai beberapa alternatif.
Alternation construct | Description | Pattern | Matches |
| | Matches any one element separated by the vertical bar (|) character. | th(e|is|at) | "the", "this" in "this is the day. " |
(?( expression )yes | no ) | Matches yes if expression matches; otherwise, matches the optional no part. Expression is interpreted as a zero-width assertion. | (?(A)A\d{2}\b|\b\d{3}\b) | "A10", "910" in "A10 C103 910" |
(?( name )yes | no ) | Matches yes if the named capture name has a match; otherwise, matches the optional no. | (?< quoted>")?(?(quoted).+?"|\S+\s) | Dogs.jpg, "Yiska playing.jpg" in "Dogs.jpg "Yiska playing.jpg"" |
8. Subtitutions
Adalah metode untuk me-replace dengan sebuah pola/pattern.
Character | Description | Pattern | Replacement pattern | Input string | Resulting string |
$number | Substitutes the substring matched by group number. | \b(\w+)(\s)(\w+)\b | $3$2$1 | "one two" | "two one" |
${name} | Substitutes the substring matched by the named groupname. | \b(?< word1>\w+)(\s)(?< word2>\w+)\b | ${word2} ${word1} | "one two" | "two one" |
$$ | Substitutes a literal "$". | \b(\d+)\s?USD | $$$1 | "103 USD" | "$103" |
$& | Substitutes a copy of the whole match. | (\$*(\d*(\.+\d+)?){1}) | **$& | "$1.30" | "**$1.30**" |
$` | Substitutes all the text of the input string before the match. | B+ | $` | "AABBCC" | "AAAACC" |
$' | Substitutes all the text of the input string after the match. | B+ | $' | "AABBCC" | "AACCCC" |
$+ | Substitutes the last group that was captured. | B+(C+) | $+ | "AABBCCDD" | AACCDD |
$_ | Substitutes the entire input string. | B+ | $_ | "AABBCC" | "AAAABBCCCC" |
9. Miscellaneous Constructs
Beberapa constructs tambahan yang bisa dimanfaatkan untuk ekstraksi string.
Construct | Definition | Example |
(?imnsx-imnsx) | Sets or disables options such as case insensitivity in the middle of a pattern. | \bA(?i)b\w+\b matches "ABA", "Able" in "ABA Able Act" |
(?#comment) | Inline comment. The comment ends at the first closing parenthesis. | \bA(?#Matches words starting with A)\w+\b |
# [to end of line] | X-mode comment. The comment starts at an unescaped # and continues to the end of the line. | (?x)\bA\w+\b#Matches words starting with A |
Regex Classes
Regex classes adalah classes yang digunakan dalam C# untuk melakukan operasi Regular Expressions. Berikut ini adalah daftar methods dari regex classes.
Sr.No | Methods |
1 | public bool IsMatch(string input)
Indicates whether the regular expression specified in the Regex constructor finds a match in a specified input string.
|
2 | public bool IsMatch(string input, int startat)
Indicates whether the regular expression specified in the Regex constructor finds a match in the specified input string, beginning at the specified starting position in the string.
|
3 | public static bool IsMatch(string input, string pattern)
Indicates whether the specified regular expression finds a match in the specified input string.
|
4 | public MatchCollection Matches(string input)
Searches the specified input string for all occurrences of a regular expression.
|
5 | public string Replace(string input, string replacement)
In a specified input string, replaces all strings that match a regular expression pattern with a specified replacement string.
|
6 | public string[] Split(string input)
Splits an input string into an array of substrings at the positions defined by a regular expression pattern specified in the Regex constructor.
|
Contoh program berikut ini mengilustrasikan pencocokan kata yang dimulai dengan huruf 'S'.
using System;
using System.Text.RegularExpressions;
namespace RegExApplication
{
class Program
{
private static void showMatch(string text, string expr)
{
Console.WriteLine("The Expression: " + expr);
MatchCollection mc = Regex.Matches(text, expr);
foreach (Match m in mc)
{
Console.WriteLine(m);
}
}
static void Main(string[] args)
{
string str = "A Thousand Splendid Suns";
Console.WriteLine("Matching words that start with 'S': ");
showMatch(str, @"\bS\S*");
Console.ReadKey();
}
}
}
Jika program dijalankan akan memberikan output sebagai berikut:
Matching words that start with 'S':
The Expression: \bS\S*
Splendid
Suns
Berikut ini contoh program untuk mencocokkan kata yang diawali dengan 'm' dan diakhiri dengan 'e':
using System;
using System.Text.RegularExpressions;
namespace RegExApplication
{
class Program
{
private static void showMatch(string text, string expr)
{
Console.WriteLine("The Expression: " + expr);
MatchCollection mc = Regex.Matches(text, expr);
foreach (Match m in mc)
{
Console.WriteLine(m);
}
}
static void Main(string[] args)
{
string str = "make maze and manage to measure it";
Console.WriteLine("Matching words start with 'm' and ends with 'e':");
showMatch(str, @"\bm\S*e\b");
Console.ReadKey();
}
}
}
Setelah program dijalankan akan memberikan output sebagai berikut:
Matching words start with 'm' and ends with 'e':
The Expression: \bm\S*e\b
make
maze
manage
measure
Contoh program berikut ini digunakan untuk mereplace white space:
using System;
using System.Text.RegularExpressions;
namespace RegExApplication
{
class Program
{
static void Main(string[] args)
{
string input = "Hello World ";
string pattern = "\\s+";
string replacement = " ";
Regex rgx = new Regex(pattern);
string result = rgx.Replace(input, replacement);
Console.WriteLine("Original String: {0}", input);
Console.WriteLine("Replacement String: {0}", result);
Console.ReadKey();
}
}
}
Seletelah program dijalankan akan memberikan output sebagai berikut:
Original String: Hello World
Replacement String: Hello World
Ok guys seperti itulah sekilas tentang regex, jika bingung itu biasa karena memang membingungkan untuk pertama kali menjamah regex. Tetapi dengan seiring waktu akan terbiasa menggunakan senjata ampuh yang satu ini. Sampai ketemu di tutorial selanjutnya
Exception Handling.