|
Unix search formats can be used when creating Virtual Folders or searching for specific files within a Virtual Folder.
Unix style search patterns allow specifying very precise and/or complex searches. The method of forming the search patterns in FileBoss is very similar to the standard notation defined for the Unix editor ed.
A Unix search pattern is made up of one or more regular expressions (RE). An RE is a string that specifies a character or group of characters that should be matched. For instance, in the following search pattern,
[A-Za-z]* 1994
there are six REs which make up the search string, they are:
[A-Za-z]*
a space
the four digits 1, 9, 9 and 4.
These, along with all the other forms of REs recognized by FileBoss are detailed below.
For users who are already familiar with the use of REs in Unix, at the end of this section is a list of the differences between FileBoss's implementation of REs and the common implementations in the Unix utilities awk, ed, grep, lex and regex.
Simple Regular Expressions
The simplest form of a RE is a single character or an escaped character (a character preceded by a backslash such as \.) which matches one character. There are three types of simple REs:
Ordinary Characters
A one-character RE that matches itself. The range of characters is 0-256.
Period (.)
A period is a one-character RE that matches any character.
Backslash (\)
A backslash followed by a special character is an RE that makes the special character into an ordinary character. Thus the RE \. (a backslash followed by a period) will match a period not a backslash and any character.
Character Classes
Character classes are REs which specify a range of characters which can be matched such as all capital letters or all lower case letters between 'a' and 'm'. Character classes are specified by a string enclosed in square brackets ([]). The characters which make up the enclosed string can specify ranges, single characters and more. Each of these is explained below.
dash (-)
The dash may be used to indicate a range of consecutive ASCII characters. For example, [0-9] is equivalent to [0123456789], [A-Z] is equivalent to all upper case letters and [A-Z0-9] is equivalent to all upper case letters and all digits.
Note that the dash loses its special meaning whenever:
it is the first character after the opening square bracket
it occurs after the initial circumflex (^).
it is the last character before the closing square bracket
it is the first character after a character range. For instance the RE [0-9-A] would match any digit, a dash or the letter 'A'.
circumflex (^)
If the first character of the string is a circumflex ^, the RE matches any character except what the RE would otherwise match. The ^ has this special meaning only if it occurs first in the string. For example, [^0-9] would match any character which is not a digit. Note that the circumflex affects all the following characters within the square brackets. Thus [^A-Z0-9?] would match any character which is not a capital letter, not a digit and not the question mark character.
Complex Regular Expressions
Complex REs are collections of REs which can be treated as a whole. The most frequent complex REs use parentheses and the asterisk (*) or plus (+) characters to specify grouping and repetition matching.
The following rules may be used to construct REs from other REs:
()
REs enclosed within parentheses are treated as a single RE for instance the RE (ab)* specifies one or more occurrences of 'ab' but with out the parentheses, i.e. ab*, would specify the letter 'a' followed by one or more occurrences of the letter 'b'.
|
REs separated by a vertical bar | form an RE that will be matched by strings in the text that match any of the REs that make up the complex RE. (as)|(ax)|(az) will be matched by either as, ax, or az.
*
An RE followed by an asterisk * matches zero or more occurrences of the RE. Note that the * will find the longest match.
ab(ba)*cb Searches for all occurrences of 'ab' followed by zero or more occurrences of 'ba' followed by 'cb'. The patterns 'abbacb', 'abbabacb', 'abbabababacb', and 'abcb' would all be treated as matching this RE.
ab(ba)* Searches for all occurrences of 'ab' followed by zero or more occurrences of ba. If more than one sequence of 'ba' follows an 'ab' in the text, the match will be made to the entire sequence.
(ba)* This will always match the beginning of the string because it specifies zero or more occurrences of 'ba'.
+
An RE followed by a plus (+) is an RE that matches one or more occurrences of the RE. Note that the + will find the longest match. If you want to find the first match then use {1,} the {} notation is explained below.
ab(ba)+ searches for all occurrences of 'ab' followed by one or more occurrences of 'ba'. If more than one sequence of 'ba' follows an 'ab' in the text, e.g., 'abbababa', the match will be made to the entire sequence.
Note that the only difference between is the asterisk and the plus sign is that the asterisk matches 0 or more occurrences and the plus sign matched 1 or more.
Positional Regular Expressions
The positional RE is used to indicate where in a line of text a match must occur. It is indicated by angle brackets <> enclosing one or more numbers. Some examples follow:
<0>
is an RE that matches the null string at position 0, the beginning of the string.
<0,5,10>
is an RE that matches the null string at position 0, or the null string at position 5, or the null string at position 10.
~
End Of Line Specification: If the position is preceded by a tilde ~, then the position is measured from the end of the string.
<~0>
matches the null string at the end of the string.
<~4>
matches the null string at position 4 counting from the end of the string.
-
Range Specification: If two positions are separated by a dash (-), a range of positions is used.
<0-5>
matches any of the null strings at positions 0 through 5,
<5-~5>
matches any null string from position 5 counting from the beginning to position 5 counting from the end. In a range specification, the second position specified must not occur before the first position specified.
<5-~5>
will always fail to match in a string of 9 characters or less, since 5 positions from the beginning occurs after 5 positions from the end.
<~0-~5>
always fails. <~5-~0> is correct.
Replication Counts
An RE followed by {m}, {m,}, {,n} or {m,n} is an RE that matches a range of occurrences of the RE. The values of 'm' and 'n' must be non-negative integers.
{m}
indicates exactly 'm' occurrences of the RE.
{m,n}
If 'm' is LESS THAN 'n', then {m,n} indicates at least 'm' occurrences of the RE and no more than 'n' occurrences. In cases where the RE occurs more than the minimum number of times specified by 'm', the match will be made to the shortest sequence.
{0,1}
This specifies that the RE must occur 0 or 1 times.
ab(ba){2,4}
given the string 'abbababababa' then the match will be made to 'abbaba', i.e. 'ab' followed by two 'ba's. If 'm' is greater than or equal to 'n', then {m,n} indicates at least 'n' occurrences of the RE and no more than 'm' occurrences. In cases where the RE occurs more than the minimum number of times specified by 'n', the match will be made to the longest sequence up to and including the maximum number specified by 'm'.
{1,0}
This specifies that the RE will occur 1 or 0 times.
ab(ba){4,2}
given the string 'abbababababa' then the match will be made to 'abbabababa', i.e. 'ab' followed by four 'ba's
{m,}
is equivalent to {m,infinity}
{,n}
is equivalent to {infinity,n}.
Note that the asterisk and the plus sign are equivalent to {0,} and {1,} respectively
Assignments
$
An RE followed by $c, where c is a letter, matches whatever the RE alone would match. (Upper and lower case are equivalent.) The expression <c>, where c is a letter, is an RE which matches whatever value is assigned to the character c. If no previous assignment has been made, then it matches the null string in any position.
Precedence
The suffix operators *, +, {}, have the highest precedence. Concatenation has next highest precedence. Alternation, |, has the lowest precedence. The order of operation may be modified by grouping with parentheses.
Differences Between Unix REGEX REs and FileBoss REs
The syntax of REs in FileBoss is almost a superset of the REs used in the Unix utility REGEX. The additional features offered by FileBoss include the following:
1. Alternation (searching for either one string or a second or a third, etc.) is allowed.
2. Generalized positional checking is allowed (as opposed to only testing for the beginning and ending of the string).
3. Assignment to variables and referencing of variables can be used in the search expression itself. This allows for context searching. For example, [a-zA-Z]$a<a> will match doubled letters.
4. All operations may act on sub expressions by grouping.
FileBoss Versus Other Programs Using Regular Expressions
|
FileBoss
|
awk
|
ed
|
grep
|
lex
|
regex
|
start of line
|
<0>
|
^
|
^
|
^
|
^
|
^
|
end of line
|
<~0>
|
$
|
$
|
$
|
$
|
$
|
any char
|
.
|
.
|
.
|
.
|
.
|
.
|
char class
|
[]
|
[]
|
[]
|
[]
|
[]
|
[]
|
alternation
|
|
|
|
|
NS
|
|
|
|
|
NS
|
grouping
|
()
|
()
|
NS
|
()
|
()
|
()
|
position
|
<>
|
NS
|
NS
|
NS
|
NS
|
NS
|
assignment
|
$
|
yes
|
NS
|
NS
|
yes
|
(RE)$
|
REPLICATION
|
1)
|
2)
|
2)
|
2)
|
2)
|
1)
|
COUNTS:
|
|
|
|
|
|
|
0 or more
|
*
|
*
|
*
|
*
|
*
|
*
|
0 or 1
|
{0,1}
|
?
|
{0,1}
|
?
|
?
|
{0,1}
|
1 or more
|
+
|
+
|
NS
|
+
|
+
|
+
|
specified #
|
{}
|
{}
|
{}
|
{}
|
NS
|
{}
|
range of #'s
|
{,}
|
{,}
|
{,}
|
{,}
|
{,}
|
{,}
|
NS = Not Supported
1) = Supports replication counts for all REs.
Ref: HIDINT_UNIXSEARCES
|