PHP version: 5
Required modules:
standard, spl, simplexml, dom, pcre
Required packages: none
class it\icosaedro\utils\UTF8
Utility functions for UTF-8 BPM string encoding
Version: $Date: 2012/04/02 09:13:24 $
Author: Umberto Salsi <salsi@icosaedro.it>
This class only provides very basic functions mostly intended to be used in others, higher level packages.
WARNING. These functions do not check for the actual encoding of the passed strings and always assume blindly these strings are properly UTF-8 encoded strings. If arbitrary data are passed, unexpected results may arise.
ATTENTION. In this document the term byte always refers to a single byte of a generic string; the term character refers to a single Unicode character, that may be encoded as a sequence of 1, 2 or 3 bytes; the term codepoint refers to the numerical code of a single Unicode character in the range [0,65535].
{
static string sanitize(
string $s)Sanitizes the string removing invalid bytes
Invalid bytes, incomplete UTF-8 sequences, non-minimal sequences and invalid BMP codepoints are removed.
Parameters:
$sThe string to sanitize, possibly NULL. Return: Properly encoded UTF-8 BMP string. If the subject string is NULL, NULL is returned as well.
static string chr(
int $code)Returns the codepoint as UTF-8 string of bytes
Parameters:
$codeCodepoint [0,65535]. Return: String of bytes that represents the given codepoint.
Throws:
- unchecked
OutOfRangeExceptionIf the codepoint is invalid.static boolean isCont(
int $b)Return true if the passed byte is the continuation byte of a UTF-8 sequence
Parameters:
$bSubject byte. Return: True if the subject byte is the continuation byte of a UTF-8 sequence.
static int sequenceLength(
int $b)Return the length of the UTF-8 sequence given its starting byte
Byte code ranges are as follows (by increasing code):
[0x00,0x7f] 1 byte sequence (ASCII) returns 1 [0x80,0xbf] continuation byte -- returns 0 [0xc0,0xc1] unused byte codes -- returns 0 [0xc2,0xdf] 2 bytes seq. starts -- returns 2 [0xe0,0xef] 3 bytes seq. starts -- returns 3 [0xf0,0xff] unused byte codes -- returns 0Parameters:
$bFirst byte of the sequence in [0,255]. Return: Length of the sequence in bytes, that is 1, 2 or 3. Returns 0 if the byte code is invalid or out of the range [0,255].
static int codepointAtByteIndex(
string $s,
int $byte_index)Returns the codepoint at a given position in a string
Parameters:
$sUTF-8 encoded string. $byte_indexByte index of the sequence. Return: The code of the codepoint.
Throws:
- unchecked
OutOfRangeExceptionIf the index is invalid.static int byteIndex(
string $s,
int $codepoint_index)Return the byte index given the UTF-8 sequence index
Parameters:
$sUTF-8 encoded string. $codepoint_indexIndex of the UTF-8 sequence, ranging from 0 (the first sequence) up to the length in characters of the string. Note that this last sequence does not exist because its byte index is just one byte above the last sequence so the index returned points to the byte just next to the end of the string. Return: Byte index of the UTF-8 sequence.
Throws:
- unchecked
OutOfRangeExceptionIf the parameter is out of the range from 0 up to the length in characters of the string.static int codepointIndex(
string $s,
int $byte_index)Returns the codepoint index given its byte index
Parameters:
$sUTF-8 encoded string. $byte_indexByte index of the codepoint, in [0,strlen($this->s)]. Note that if $byte_index is exactly equal to strlen($this->s), then the result is the length of the string in codepoints. Return: Byte index of this codepoint, that is the number of UTF-8 sequences from the beginning of the string up there.
Throws:
- unchecked
OutOfRangeExceptionIf $byte_index is out of the range [0,strlen($this->s)].int codepointAt(
string $s,
int $codepoint_index)Returns the code of the codepoint at the given index
Parameters:
$sUTF-8 encoded string. $codepoint_indexIndex of the codepoint, in the range from 0 up to the length of the string minus one. Note that for an empty string there is no valid range. Return: Code of the codepoint.
Throws:
- unchecked
OutOfRangeExceptionIf the index is invalid.static int length(
string $s)Return the length of the string as number of characters
Parameters:
$sUTF-8 encoded string. Return: Length of the string as number of characters.
string charAt(
string $s,
int $i)Returns the character at the given index
Parameters:
$sUTF-8 encoded string. $iIndex of the character in the range from 0 up to UTF8::length($s)-1. Return: The character as a UTF-8 string. The returned string may contain from 1 up to 3 bytes.
Throws:
- unchecked
OutOfRangeExceptionIf the index is invalid.
}
Generated by PHPLint Documentator