Figure 1 - Matching
As illustrated in figure in the previous section, a filter is matched against directory information to determine which directory information (directory entries) that are to be returned to the accessing user. However, such a matching is not a straightforward matter. It is not just a simple, octet for octet comparison. Matching is complicated by:
- The information in the filter and stored directory information are encoding using different character sets, for example Latin-1 and Universal multiple-octet coded Character Set (UCS).
- Even when both the filter and the stored directory information is using UCS, different encodings might be applied. Each character may be encoded in different ways:
- UCS-4 (UTF-32) - word oriented
- BMP (UCS-2) - half word oriented
- UTF-8 - octet oriented
It is necessary to bring the filter information and the stored directory information into a common format.
- Even when the filter information and the directory information are normalized into the same UCS encoding, some characters have alternative encodings. As an example, the Danish 'Å' has three different encodings:
- The Danish 'Å' has its own code point. In UTF-8 is then C3 85'H
- The Danish 'Å' may be a combination of 'A' (61'H in UTF-8 ) concatenated with RING ABOVE (030A in UCS-2 and or CC8A in UTF-8).
- The Danish 'Å' may also be coded as the Angstrom sign (E2 84 AB'H in UTF-8).
- Most matching is case ignore. As an example, when case ignore matching is used, Copenhagen will match CopenHagen and copenhagen.