Understanding Regular
Expressions
•
Very powerful and quite cryptic
•
Fun once you get to use them
•
Regular expressions are a language unto themselves
•
A language of "marker characters" - programming with
characters
•
It is kind of an "old school" language - compact
Regular Expression Quick Guide
^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
\s Matches whitespace
\S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end
The Regular Expression Module
•
Before you can use regular expressions in your program, you must
import the library using "import re"
•
You can use re.search() to see if a string matches a regular expression
similar to using the find() method for strings
•
You can use re.match() extract portions of a string that match your
regular expression similar to a combination of find() and slicing:
var[5:10]
Using re.search() like find()
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('From:', line) :
print line
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if line.find('From:') >= 0:
print line
Using re.search() like startswith()
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:', line) :
print line
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if line.startswith('From:') :
print line
We fine-tune what is matched by adding special characters to the string
Wild-Card Characters
•
The dot character matches any character
•
If you add the asterisk character, the character is "any number of
times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Wild-Card Characters
•
The dot character matches any character
•
If you add the asterisk character, the character is "any number of
times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match the start of the line
Match any character
Many times
Wild-Card Characters
•
The dot character matches any character
•
If you add the asterisk character, the character is "any number of
times"
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
X-DSPAM-Confidence: 0.8475
X-Content-Type-Message-Body: text/plain
^X.*:
Match the start of the line
Match any character
Many times
Fine-Tuning Your Match
•
Depending on how "clean" your data is and the purpose of your
application, you may want to narrow your match down a bit
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
XPlane is behind schedule: two weeks
^X.*:
Match the start of the line
Match any character
Many times
Fine-Tuning Your Match
•
Depending on how "clean" your data is and the purpose of your
application, you may want to narrow your match down a bit
X-Sieve: CMU Sieve 2.3
X-DSPAM-Result: Innocent
XPlane is behind schedule: two weeks
^X-\S+:
Match the start of the line
Match any non-whitespace character
One or more
times
Không có nhận xét nào:
Đăng nhận xét