UPattern

PHP version: 5

Required modules: standard, spl, simplexml, dom, pcre

Required packages: all.php, errors.php, autoload.php, AutoloadException.php, cast.php, Types.php, ArrayBothType.php, TypeInterface.php, Printable.php, ArrayIntType.php, ArrayStringType.php, BooleanType.php, ClassType.php, FloatType.php, IntType.php, NullType.php, ObjectType.php, ResourceType.php, StringType.php, CastException.php, UGroup.php, UElement.php, UString.php, Sortable.php, Comparable.php, Hashable.php, UTF8.php, Strings.php, Codepoints.php, Hash.php, TestUnit.php, Equality.php, Floats.php, Integers.php, Group.php, Element.php, Pattern.php



class it\icosaedro\regex\UPattern
      |  
      +--(it\icosaedro\regex\UElement)
      `--(it\icosaedro\containers\Printable)

Parses subject Unicode string according to a pattern given by a regular expression

Version: $Date: 2012/05/06 16:07:11 $

Author: Umberto Salsi <salsi@icosaedro.it>

An instance of this class compiles and holds an internal representation of the regular expression that may be used several times to match against different subject strings. After every successful match, designated matching sub-parts, the elements, can be extracted. The subject string may or may not match the pattern; only if it match, the parts of the subject we are interested on can be extracted.

Syntax of the pattern. The pattern is the logical OR of one or more terms separated by vertical bar. Using the EBNF formalism, this statement can be expressed as follows:

 	expression = term {"|" term};
 

The matching between the expression and the subject string always starts from the beginning of the subject string trying every term one by one, in the order, searching for a matching term. If no term matches, the whole matching fails.

A term is a sequence of one or more factors:

 	term = factor {factor};
 

The term matches if all the factors match, in the order. Factors have several forms that may represent a single character, a set of characters, a sub-expression and some other special symbols, and may include a repetition quantifier:

 	factor = "^" | "$"
 		| "." [quantifier]
 		| "(" expression ")"
 		| "{" set "}"
 		| character [quantifier];
 

where:

. (dot) matches a single character.

^ matches the beginning of the subject.

$ matches the ending of the subject.

(E) is a sub-expression, also named element through this document. Elements can be introduced to alter the order of the evaluation between terms an factors or to group a sequence of factors to which a quantifier has to be applied. Any part of the subject string that matches an element, at any nesting level, can be extracted from the result of the parsing through the it\icosaedro\regex\UElement and the it\icosaedro\regex\UGroup interfaces, as will be explained later.

set is a list of characters that may match a single character of the subject. Ranges can be expressed as a-b. A leading exclamation character ! yields the complementar set of the set that follows. If the hyphen character has to be included literally, it can be inserted either as the first character or in the last position in the set; if the exclamation mark has to be included literally, it cannot appear as the first character. The empty set [] always fails. The complement of the empty set [!] matches any character.

Normally every factor matches exactly once or it fails. If a quantifier is added then the factor may match the desired number of times, possibly with several attempts performed with different number of matching factors. The most general quantifier is the interval [min,max] where min and max are two non-negative integer numbers that give the minimum and the maximum number of times the factor must match. Both these numbers can be omitted: if min is omitted it defaults to 0; if max is omitted it defaults to PHP_INT_MAX which is also the maximum allowed number. Some common abbreviations are also allowed:

F? is the same as F[0,1] (optional factor F)

F* is the same as F[0,] (zero or more)

F+ is the same as F[1,] (one or more)

If the quantifier is present, the matching algorithm operates in possessive mode, where the maximum number of matches is attempted and no further attempts are made. For example, the pattern .* consumes all the remaining subject string, then the matching either succeeds or fails without further attempts.

Two modifiers can follow the quantifier to select two more alternative algorithms:

? performs the reluctant algorithm, where the minimum number of matches is attempted first, then (min+1), (min+2), ..., max attempts are made until the expression succeeds. For example, the expression .*? first tries with the empty string (that always succeeds), then consumes 1 character and retries, and so on.

* performs the greedy algorithm, first trying to consume up to max factors (but not less than min) and continuing with the rest of the expression; if the rest of the expression does not match, then performs backtracking and retries to consume as much factors as it can generating more attempts and continues the evaluation of the rest of the expression; the evaluation of the factor stops when less than min matching are possible. Then, for example, the pattern .** first tries to consume the whole remaining subject string and, if the rest of the expression fails, further attempts are made consuming less characters.

Encoding of the special characters. The following characters have a special meaning and can match their literal value only if escaped by back-slash:

 	\  .  |  (  )  [  ]  {  }  ?  *  +  ^  $
 

Characters that are special under PHP requires to be furtherly escaped so that, for example, the literal back-slash becomes a double back-slash to meet the requirements of this class, so ending with 4 back-slashes in the final PHP string "\\\\" just to match a single literal back-slash. Escaping non-special characters is forbidden to leave space for future enhancements of this specification.

Example 1 - Matching an integer number. An integer number can have a sign followed by one or more digits. In the chunk of code below, we compile the regular expression first and then we test if a given string does match it:

 	$p = new UPattern( UString::fromASCII("{-\\+}{0-9}+\$") );
 	$s = UString::fromASCII("1234");
 	if( $p->match($s) )
		echo "ok";
 

The same compiled pattern can be applied several times. Note how the special characters must be escaped. Also note that a leading ^ is not required because expressions are always applied starting from the beginning of the subject string.

Enumerating and extracting groups and elements. Sub-expressions enclosed between round parentheses are elements. The element along with its quantifier is a group of elements that match zero, one or several times. For example, the group

(X)[1,3]

may match the element (X) from 1 up to 3 times. Since the body of the element, X, may in turn contain others groups, this class provides an interface to retrieve also these sub-groups as detailed below.

The whole pattern must be considered as the element number 0, as if it where enclosed between parentheses. This zero element may contain several sub-groups that are numbered starting from 0, so that the first group may be identified with the sequence of numbers 0.0 and continuing with 0.1 for the second group, 0.2 for the third group and so on. Even these sub-groups may contain other sub-sub-groups that are numbered starting from 0 and so on:

0.0(0.0.0(A)B0.0.1(C))0.1(0.1.0(D)E)

The UPattern class provides the it\icosaedro\regex\UElement interface that allows to access the outermost element number 0: the it\icosaedro\regex\UElement::start() method returns the offset of the beginning of the subject string that matches the whole pattern; the it\icosaedro\regex\UElement::end() method returns the ending of the portion that matched the pattern; finally, the it\icosaedro\regex\UElement::value() method returns this portion of the subject string:

 	$p->start() => start offset of the matching
 	$p->end()   => end offset of the matching
 	$p->value() => portion of the subject string that matches
 

The UElement interface also provides the it\icosaedro\regex\UElement::group($g) that retrieves the specified group as instance of the it\icosaedro\regex\UGroup interface. Looking at the example above, $g can be only 0 or 1. The UGroup::count() method retrieves the number of matches for the given element, and it\icosaedro\regex\UGroup::elem($i) retrieves the element number $i with 0 <= $i < count().

Always referring to the example above, since there are no quantifiers, every element must match exactly once for every group and then the argument of the elem($i) method is elways 0 in this case:

 	$p->value() => "ABCDE" (as UString object)
 	$p->group(0)->elem(0)->value() => "ABC" (as UString object)
 	$p->group(0)->elem(0)->group(0)->elem(0)->value() => "A" (as UString object)
 	$p->group(0)->elem(0)->group(1)->elem(0)->value() => "C" (as UString object)
 	$p->group(1)->elem(0)->value() => "DE" (as UString object)
 	$p->group(1)->elem(0)->group(0)->elem(0)->value() => "D" (as UString object)
 

Note that for every element retrieved, the list of the group($g) arguments exactly matches the path that brings from the outermost element 0 to the requested group, so for example 0.1.0 is the group (E).

Example 2 - Parsing a string of key=value pairs. Supposing a sequence of lines of the form

 	$line = UString::fromASCII("alpha = 1, beta = 2, gamma = 3");
 

be given, we start compiling the pattern:

 	# A key is a sequence of letters and digits:
 	$K = "{a-zA-Z_}{a-zA-Z_0-9}*+";
 	# A value is an integer number:
 	$V = "{-\\+}{0-9}++";
 	# White space:
 	$SP = "{ \t}*+";
 	$pattern = UString::fromASCII("$SP($K)$SP=$SP($V)$SP(,$SP($K)$SP=$SP($V))++$SP\\$");
 	$p = new UPattern($pattern);
 

For each line of input, we test if it matches the pattern and we extract groups and elements:

 	if( $p->match($line) ){
 		echo $p->group(0)->elem(0)->value()->toASCII(); # => "alpha"
 		echo $p->group(1)->elem(0)->value()->toASCII(); # => "1"
 		$group2 = $p->group(2);
 		for($i = 0; $i < $group2->count(); $i++){
 			echo $group2->elem($i)->group(0)->elem(0)->value()->toASCII(); # => "beta" and "gamma"
 			echo $group2->elem($i)->group(1)->elem(0)->value()->toASCII(); # => "2" and "3"
 		}
 	}
 

Note that more complex results can be easily explored by a recursive algoritm.


{

void __construct(
        it\icosaedro\utils\UString $re)

Compiles the specified regular expression for later usage

Once compiled, the same pattern can be applied several times to different subject strings.

Parameters:
$re   The regular expression to compile.

Throws:

boolean match(
        it\icosaedro\utils\UString $s,
        int $start = 0)

Tells if the subject string matches this pattern

Parameters:
$s   The subject string.
$start   Matching of the subject string starts from this offset.

Return: True if the subject string matches this pattern.


int start()
implements it\icosaedro\regex\UElement

int end()
implements it\icosaedro\regex\UElement

it\icosaedro\utils\UString value()
implements it\icosaedro\regex\UElement

int count()
implements it\icosaedro\regex\UElement

it\icosaedro\regex\UGroup group(
        int $g)

implements it\icosaedro\regex\UElement


string __toString()
implements it\icosaedro\containers\Printable

Returns this pattern in canonicized, ASCII form

Return: This pattern in canonicized, ASCII form.


it\icosaedro\utils\UString resultAsUString(
        it\icosaedro\utils\UString $separator)

Returns the result of the last successful match as a structured string

Mostly useful for testing. The returned string may have a form similar to this one, although it might vary in future implementations:

 0 "alpha = 1, beta = 2, gamma = 3"
 0.0 "alpha"
 0.1 "1"
 0.2 ", beta = 2"
 0.2.0 "beta"
 0.2.1 "2"
 0.2 ", gamma = 3"
 0.2.0 "gamma"
 0.2.1 "3"
 
Every line is an element; the numbers separated by dot are paths of groups; the literal string between double quotes is the literal representation of the matching string.

Parameters:
$separator   Separator string between elements.

Return: Readable representation of all the matched groups and elements.

Throws:

static boolean matches(
        it\icosaedro\utils\UString $re,
        it\icosaedro\utils\UString $s,
        int $start = 0)

Tells if the regular expression matches a given subject string

Convenience method for simple one-shot tests.

Parameters:
$re   Regular expression.
$s   Subject string. NULL behaves just like the empty string.
$start   Matching of the subject string starts from this offset.

Return: True if the subject string matches the regular expression.

Throws:

}

Private items

it\icosaedro\regex\UEmptyGroup
it\icosaedro\regex\UMatchedGroup
it\icosaedro\regex\UMatchedElement

Generated by PHPLint Documentator