原文链接:http://oracle-base.com/articles/misc/regular-expressions-support-in-oracle.php
相干文章:
Oracle 10g introduced support for regular expressions in SQL andPL/SQL with the following functions.
Oracle 10g开始支持在SQL和PLSQL中使用以下正则表达式:
Oracle 11g introduced two new features related to regularexpressions.
11g开始引入2个新特性:
Learning to write regular expressions takes a little time. If youdon't do it regularly, it can be a voyage of discovery each time. The generalrules for writing regular expressions are available here.You can read the Oracle Regular Expression Support here.
Rather than trying to repeat the formal definitions, I'll presenta number of problems I've been asked to look at over the years, where asolution using a regular expression has been appropriate.
此处不重复正则表达式的定义,代之以1组问题导向的正则表达式利用实例:
The data in a column is free text, but may include a 4 digit year.
数据在字段中以自由文本寄存,但是可能包括4个精度的年份数据。
If we needed to return rows containing a specific year we coulduse the LIKE
operator (WHERE data LIKE '%2014%'
),but how do we return rows using a comparison (<, <=, >, >=,<>)?
One way to approach this is to pull out the 4 figure year andconvert it to a number, so we don't accidentally do an ASCII comparison. That'spretty easy using regular expressions.
如果我们需要返回包括指定年份的数据我们可使用LIKE操作符(…),但是如何通过不等操作符返回行?1条路是抽出4个数字的年份并转换为数字。通过正则表达式可以很容易实现。
We can identify digits using the "d" or"[0⑼]" operators. We want a group of four of them, which isrepresented by the "{4}" operator. So our regular expression will be"d{4}" or "[0⑼]{4}". The REGEXP_SUBSTR
functionreturns the string matching
the regular expression, so that can be used toextract the text of interest. We then just need to convert it to a number andperform our comparison.
我们使用d或[0⑼]来辨认数字。我们需要4个1组,可使用{4}表示。至此,我们的正则表达式为:d{4}或[0⑼]{4}。REGEXP_SUBSTR函数返回匹配指定正式表达式的字符串,所以可以用来提取我们感兴趣的文本。然后我们只需将其转换为数字并履行比较便可。
Given a source string, how do we split it up into separatecolumns, based on changes of case and alpha-to-numeric, such that this.
给定1个元字符串,要求依照指定规则(基于字母大小写和字母到数字的变化)分割为多个列:
Becomes this. 分割后:
The source data is set up like this. 元数据以下:
The first part of the string is an initcap word, so it starts witha capital letter between "A" and "Z". We identify a singlecharacter using the "[]" operator, and ranges are represented using"-", like "A-Z", "a-z" or "0⑼". So ifwe are looking for a single character that is a capital letter, we need to lookfor "[A-Z]". That needs to be followed by lower case letters, whichwe now know is "[a-z]", but we need 1 or more of them, which issignified by the "+" operator. So to find an initcap word, we need tosearch for "[A-Z][a-z]+". Since we want the first occurrence of this,we can use the following.
字符串第1部份为大写字母,可能为A-Z。我们使用[]操作符辨认单个字符,至于范围则用“-”,例如“A-Z”,"a-z"或"0⑼"。所以如果我们需要找大写的首字母则用“[A-Z]”。其后紧随着的是若干小写字母,可以用+表示若干(1个或多个)。组合起来的正则表达式即为:[A-Z][a-z]+,这样拆分出的第1列方法有了。
The second part of the string is a group of 1 or more uppercaseletters. We know we need to use the "[A-Z]+" pattern, but we need tomake sure we don't get the first capital letter, so we look for the secondoccurrence.
第2部份是1组包括1个或多个大写字母。我们知道需要用模式:[A-Z]+,但是为了不和第1部份冲突,我们指明匹配其第2次出现的文本。
The next part is the first occurrence of a group of numbers.
下1部份是1组纯数字。
The next part is a group of lower case letters. We don't to pickup those from the initcap word, so we must look for the second occurrence oflower case letters.
下1部份是1组小写字母,一样斟酌了不和第1部份冲突:
Finally, we have a group of numbers, which is the secondoccurrence of this pattern.
最后,是1组数字:
Putting that all together, we have the following query, whichsplits the data into separate columns.
将以上每部份正则表达式的输出分别作为独立字段:
We need to pull out a group of characters from a "/"delimited string, optionally enclosed by double quotes. The data looks likethis.
我们需要从1个字符串(含有分隔字符/和双引号” ”)中提取1组字符,原始数据以下:
We are looking for 1 or more characters that are not"/", which we do using "[^/]+". The "^" in thebrackets represents NOT and "+" means 1 or more. We also want toremove optional double quotes, so we add that as a character we don't want,giving us "[^/"]+". So if we want the data from the thirdcolumn, we need the third occurrence of this pattern.
我们要找1个或多个非“/“字符,可使用”[^/]+“。^在方括号中表示NOT。我们还需要移除可选的双引号所以需要使用[^/”]+。所以如果我们需要获得第3次出现的字符串:
We need to take an initcap string and separate the words. The datalooks like this.
我们需要提取首字母大写的字符串并将其分离。原始数据以下:
We need to find each uppercase character "[A-Z]". Wewant to keep that character we find, so we will make that pattern asub-expression "([A-Z])", allowing us to refer to it later. For eachmatch, we want to replace it with a space, plus the matching character. Thespace is pretty obvious, but we need to use "1" to signify the textmatching the first sub expression. So we will replace the matching pattern witha space and itself, " 1". We don't want to replace the first letterof the string, so we will start at the second occurrence.
我们需要使用[A-Z]找到每一个大写字符。我们需要保存找到的字符,所以我们使用1个子表达式([A-Z]),以便后续对其援用。对每个匹配,我们想使用1个空格替换,加上匹配到的字符。空格是相当明显的,但我们需要使用”1”表示第1个子表达式匹配的文本。所以我们替换匹配模式使用1个空格和其本身,即”1”。我们不想替换字符串的第1个字母,所以我们从第2个字符开始:
We have a specific pattern of digits (9 99:99:99) and we want toknow the location of the pattern in our data.
我们有1个指定数字模式(999:99:99)并且我们想知道模式在我们数据中所处位置。
We know we are looking for groups of numbers, so we can use"[0⑼]" or "d". We know the amount of digits in eachgroup, which we can indicate using the "{n}" operator, so we simplydescribe the pattern we are looking for.
我们知道我们正在找1组数字,所以使用"[0⑼]"或"d"。我们知道每组数字的数量,所以可使用{n}操作符,所以我们简单描写1下模式:
We have strings containing parentheses. We want to return the textwithin the parentheses for those rows that contain parentheses.
我们有包括在括号内的字符串。我们想只想返回括号内的字符串。
The basic pattern for text between parentheses is"(.*)". The "" characters are escapes for theparentheses, making them literals. Without the escapes they would be assumed todefine a sub-expression. That pattern alone is fine to identify the rows of interestusing
a REGEXP_LIKE
operator,but it is not appropriate in a REGEXP_SUBSTR
, as itwould return the parentheses also. To omit the parentheses we need to include asub-expression inside the literal parentheses "((.*))". We can then REGEXP_SUBSTR
using
thefirst sub expression.
匹配括号内文本的模式基本写法为:“(.*)”。是转义字符,使跟在其后的字符变成字面值。但是这个模式用在REGEXP_SUBSTR时会连括号1起返回。为了疏忽括号我们需要在字面括号内部包括子表达式:"((.*))".
REGEXP_SUBSTR(data, '((.*))', 1, 1, 'i', 1) AS without_parentheses
注意:REGEXP_SUBSTR(data,'((.*))', 1, 1, 'i', 1) 中最后的i代码不辨别大小写,最后1个“1”代表返回哪一个子表达式匹配的文本。(范围0⑼)
We need to know how many times a block of 4 digits appears intext. The data looks like this.
我们需要知道4个数字的块在字符串中出现的次数。看原始数据:
We can identify digits using "d" or "[0⑼]"and the "{4}" operator signifies 4 of them, so using"d{4}" or "[0⑼]{4}" with the REGEXP_COUNT
functionseems to be a valid option.
我们可以用表达式:d 或[0⑼]和{4}操作符辨认4个数字的块。
We need to identify invalid email addresses. The data looks likethis.
我们需要校验邮箱地址,原始数据以下:
The following test gives us email addresses that approximate toinvalid email address formats.
以下测试给我们近似不合法的邮箱。
-----------------------------
Dylan Presents.
上一篇 计算机网络综合实验