Saturday, February 21, 2015

Regular Expression

Regular Expression is a pain in the something. I have to learn it today. I better write out before forgetting everything. Here is the link I follow Regex Python.

The main command is:\n
re.sub('abc', '', 'abc111abc')
which act similar to .replace() in Python. A couple special characters:
  • "." means any character except newline character. To get new line we include the flag re.S at the end of re.sub
  • "*" means repeat the previous character zero time, one time, or any time
  • "^" somehow means not some character. It is also used to denote beginning of line
Let see some example:

re.sub('abc.*(def).*xyz', '', '222abc1def11xyz333')
# 222333
re.sub('abc.*(def).*xyz', '', '222abc1de1f11xyz333')
# '222abc1de1f11xyz333'
means search for a pattern starts with "abc", next with any character or no character, does contain the phrase "def" somewhere in the middle, continue with any character and ends with "xyz". Now this kind of search is greedy, it try to find the longest patter possible.

re.sub('abc((?!def).)*xyz', '', '222abc1def11xyz333')
# '222abc1def11xyz333'
means almost the same, except that the phrase "def" must not be somewhere in the middle. Now if we want to search thing closest possible - being lazy, instead of greedy, we put a question mark:

re.sub('abc((?!dee).)*?xyz', '', '000abc111def222xyz333xyz')
# 000333xyz
means search for the shortest pattern that starts with 'abc', end with 'xyz' and does not contain 'dee'. Now there is the symbol ^ to denote the beginning of line, $ to denote the end, re.sub('^abc.*(def).*xyz$', '', '222abc1def11xyz333') will not match anymore. In conclusion, we have

def StripCBlock(s, w1 = 'pre', w2 = 'pre'):
    return re.sub('^' + w1 + '((?!' + w1 + ').)*' + w2 + '$', '', s,  
                  flags = re.MULTILINE | re.S)
def StripInline(s, w1 = 'code', w2 = 'code'):
    return re.sub(w1 + '((?!' + w1 + ').)*?' + w2, '', s,  
                  flags = re.MULTILINE | re.S)
re.sub('[_/\-,\.]',' ', s)
re.sub('[\t\n\r\f\v]',' ', s)
re.sub('[^a-zA-Z ]',' ',s).lower()
re.sub('\s+', ' ', s).strip()
The first function will search for pattern that starts with w1 (right before w1 is a newline, which is why "^" was in front), then go as far as possible and ends with w2, but must not contain w1 in the middle. The $ means w2 should be right in front of a newline character. The first flag is for searching to work with newline in the middle. The second flag is for "." to include newline. \n The second function will search for any pattern that starts with w1, ends with w2 as soon as possible, and must not contain w1 in the middle. The next line replace any of the special character to space. The line after that replace any of the white space to a space. The line after keeps only character and space, and convert uppercase to lowercase. The final line convert multiple space into one single space.

No comments:

Post a Comment