Unicode Normalization for Vietnam

Test Page 1

A Test and Demonstration Page by Stefan Probst and James Do

Version 1.0: 2001-12-26 Stefan Probst -- initial idea and first part
    format of original file: "charset=ISO 8859-1"
Version 1.1: 2001-12-27 jDo -- added UTF-8 section, format of file therefore "UTF-8"
Version 1.2: 2001-12-28 Stefan Probst -- minor editorial touch up and added "Notes" for educational purposes.
Version 1.3: 2002-01-05 Stefan Probst -- added search test and test form for feedback

 

A) Unicode Characters outside the ASCII table
are encoded using "Numerical Character References" (NCR)

(when using only NCRs, the HTML file may also be specified as "charset=ISO 8859-1",
i.e. as a normal 7-bit plain ASCII file)
Only usable when the applications can handle NCRs, e.g. web browsers.
Those applications recognize an NCR at the initial "&#" escape sequence.

1) Fully Precomposed:
Unicode Normalization Form C ("NFC"):
Việt Nam: Việt Nam Việt Nam

2) "VN-Combining", i.e. characters pre-composed, tone marks combining
Note: There is no international standard for this behaviour:
Viê  ̣t Nam: Việt Nam Việt Nam

3) "VN-Canonical", i.e. only combining characters, tone marks sorted last
Note: There is no international standard for this order:
Vie ^  ̣t Nam: Việt Nam Việt Nam

4) Only combining characters, sorted by canonical order:
Unicode Normalization Form D ("NFD"):
Vie  ̣^t Nam: Việt Nam Việt Nam

 

B) Unicode Characters outside the ASCII table
are encoded using UTF-8 format

(i.e. they are encoded as a sequence of printable bytes including the right side of the 8-bit table.)
In order to instruct the browser or any other application to interpret those sequences
as representations of an Unicode character (and not to render them directly),
the file format has to be specified as "charset=utf-8"

1) Fully Precomposed:
Unicode Normalization Form C ("NFC"):
Việt Nam: Việt Nam Việt Nam

2) "VN-Combining", i.e. characters pre-composed, tone marks combining
Note: There is no international standard for this behaviour:
Viê  ̣t Nam: Việt Nam Việt Nam

3) "VN-Canonical", i.e. only combining characters, tone marks sorted last
Note: There is no international standard for this order:
Vie ^  ̣t Nam: VieÌ‚Ì£t Nam Việt Nam

4) Only combining characters, sorted by canonical order:
Unicode Normalization Form D ("NFD"):
Vie  ̣^t Nam: Việt Nam Việt Nam


Notes:
- Both file encodings (and a few more) can be used to represent "Unicode characters".
- NCRs can use decimal values (&#...;) or hex values (&#x...;).
- In HTML files, the charset specification in the header does not only tell the browser
      how to read the file, but also how to encode input from forms,
      when sending the data back to the server.

Test Instructions:
Open this file in a web browser (e.g. Internet Explorer).
- do all the characters appear ok?
Print it.
- do all the characters appear ok?
Do a search test:
- copy all eight forms (from A1 to B4) of the Vietnamese word "Viet Nam" one by one,
      paste them into your browser's "find/search" dialog box,
      and check in each case which of the eight occurrences are found.
Then copy the whole page and paste it into your word processor (e.g. MS Word).
- are the characters ok?
Change the font size of the whole document, e.g. to "12", "14", etc.
- how are the characters rendered? (i.e. quality of the "ệ": is the dot exactly below it?)
Repeat the search test like for the browser.

Copy the following form into your eMail program,
fill it as far as possible, and send it to Unicode-Tests@isoc-vn.org

*******************************************
Results of Unicode Tests
Used Testpage: 1.3

1) Platform:
OS (kind, version)     :
Browser (incl. version):
Wordprocessor          :
Printer                :

2) Results:
Display in Browser
                     ok:
                 not ok:
               comments:
Print from Browser
                     ok:
                 not ok:
               comments:
Find in Browser
  find version A1 finds:
  find version A2 finds:
  find version A3 finds:
  find version A4 finds:
  find version B1 finds:
  find version B2 finds:
  find version B3 finds:
  find version B4 finds:
Display in Wordprocessor
                     ok:
                 not ok:
               comments:
Print from Wordprocessor
                     ok:
                 not ok:
               comments:
Find in Wordprocessor
  find version A1 finds:
  find version A2 finds:
  find version A3 finds:
  find version A4 finds:
  find version B1 finds:
  find version B2 finds:
  find version B3 finds:
  find version B4 finds:

Other comments         :
Tested by              :
*******************************************
Example:
*******************************************
Results of Unicode Tests
Used Testpage: 1.3

1) Platform:
OS (kind, version)     : Windows ME
Browser (incl. version): Internet Explorer 5.5
Wordprocessor          : MS Word 97
Printer                : HP Laserjet 4L

2) Results:
Display in Browser
                     ok: all
                 not ok:
               comments:
Print from Browser
                     ok: A2, A3, A4, B2, B3, B4
                 not ok: A1, B1 (prints question marks "?")
               comments:
Find in Browser
  find version A1 finds: A1, B1
  find version A2 finds: A2, B2
  find version A3 finds: A3, B3
  find version A4 finds: A4, B4
  find version B1 finds: A1, B1
  find version B2 finds: A2, B2
  find version B3 finds: A3, B3
  find version B4 finds: A4, B4
Display in Wordprocessor
                     ok: A1, A2, B1, B2
                 not ok: A3, A4, B3, B4 (squares)
               comments: dot in A2 and B2 in some font sizes
                         not exactly below the "e", but far left.
Print from Wordprocessor
                     ok: A1, A2, B1, B2
                 not ok: A3, A4, B3, B4 (squares)
               comments: dot in A2 and B2 in some font sizes
                         not exactly below the "e", but far left.
Find in Wordprocessor
  find version A1 finds: A1, B1
  find version A2 finds: A2, B2
  find version A3 finds: A3, B3
  find version A4 finds: A4, B4
  find version B1 finds: A1, B1
  find version B2 finds: A2, B2
  find version B3 finds: A3, B3
  find version B4 finds: A4, B4

Other comments         :
Tested by              : Stefan Probst
*******************************************