UTF8

PHP version: 5

Required modules: standard, spl, simplexml, dom, pcre

Required packages: none



class it\icosaedro\utils\UTF8

Utility functions for UTF-8 BPM string encoding

Version: $Date: 2012/04/02 09:13:24 $

Author: Umberto Salsi <salsi@icosaedro.it>

This class only provides very basic functions mostly intended to be used in others, higher level packages.

WARNING. These functions do not check for the actual encoding of the passed strings and always assume blindly these strings are properly UTF-8 encoded strings. If arbitrary data are passed, unexpected results may arise.

ATTENTION. In this document the term byte always refers to a single byte of a generic string; the term character refers to a single Unicode character, that may be encoded as a sequence of 1, 2 or 3 bytes; the term codepoint refers to the numerical code of a single Unicode character in the range [0,65535].


{

static string sanitize(
        string $s)

Sanitizes the string removing invalid bytes

Invalid bytes, incomplete UTF-8 sequences, non-minimal sequences and invalid BMP codepoints are removed.

Parameters:
$s   The string to sanitize, possibly NULL.

Return: Properly encoded UTF-8 BMP string. If the subject string is NULL, NULL is returned as well.


static string chr(
        int $code)

Returns the codepoint as UTF-8 string of bytes

Parameters:
$code   Codepoint [0,65535].

Return: String of bytes that represents the given codepoint.

Throws:

static boolean isCont(
        int $b)

Return true if the passed byte is the continuation byte of a UTF-8 sequence

Parameters:
$b   Subject byte.

Return: True if the subject byte is the continuation byte of a UTF-8 sequence.


static int sequenceLength(
        int $b)

Return the length of the UTF-8 sequence given its starting byte

Byte code ranges are as follows (by increasing code):

 [0x00,0x7f]  1 byte sequence (ASCII) returns 1
 [0x80,0xbf]  continuation byte -- returns 0
 [0xc0,0xc1]  unused byte codes -- returns 0
 [0xc2,0xdf]  2 bytes seq. starts -- returns 2
 [0xe0,0xef]  3 bytes seq. starts -- returns 3
 [0xf0,0xff]  unused byte codes -- returns 0
 

Parameters:
$b   First byte of the sequence in [0,255].

Return: Length of the sequence in bytes, that is 1, 2 or 3. Returns 0 if the byte code is invalid or out of the range [0,255].


static int codepointAtByteIndex(
        string $s,
        int $byte_index)

Returns the codepoint at a given position in a string

Parameters:
$s   UTF-8 encoded string.
$byte_index   Byte index of the sequence.

Return: The code of the codepoint.

Throws:

static int byteIndex(
        string $s,
        int $codepoint_index)

Return the byte index given the UTF-8 sequence index

Parameters:
$s   UTF-8 encoded string.
$codepoint_index   Index of the UTF-8 sequence, ranging from 0 (the first sequence) up to the length in characters of the string. Note that this last sequence does not exist because its byte index is just one byte above the last sequence so the index returned points to the byte just next to the end of the string.

Return: Byte index of the UTF-8 sequence.

Throws:

static int codepointIndex(
        string $s,
        int $byte_index)

Returns the codepoint index given its byte index

Parameters:
$s   UTF-8 encoded string.
$byte_index   Byte index of the codepoint, in [0,strlen($this->s)]. Note that if $byte_index is exactly equal to strlen($this->s), then the result is the length of the string in codepoints.

Return: Byte index of this codepoint, that is the number of UTF-8 sequences from the beginning of the string up there.

Throws:

int codepointAt(
        string $s,
        int $codepoint_index)

Returns the code of the codepoint at the given index

Parameters:
$s   UTF-8 encoded string.
$codepoint_index   Index of the codepoint, in the range from 0 up to the length of the string minus one. Note that for an empty string there is no valid range.

Return: Code of the codepoint.

Throws:

static int length(
        string $s)

Return the length of the string as number of characters

Parameters:
$s   UTF-8 encoded string.

Return: Length of the string as number of characters.


string charAt(
        string $s,
        int $i)

Returns the character at the given index

Parameters:
$s   UTF-8 encoded string.
$i   Index of the character in the range from 0 up to UTF8::length($s)-1.

Return: The character as a UTF-8 string. The returned string may contain from 1 up to 3 bytes.

Throws:

}


Generated by PHPLint Documentator