Print Page - Unicode Research and Programming

Title: Unicode Research and Programming
Post by: RayRay on September 05, 2011, 04:53:55 PM

I was going to explain Unicode further on the Unicode Snowman topic, but SMF suggested that I should make another topic.

Remember that I said you could produce a Unicode or ASCII character with Alt and the Keypad?

After learning some hexadecimal, (the format consisting of 16 numbers <0-9, A-F>, with decimal consisting of 10) I learned the TRUE meaning of Unicode. (you could look this stuff up on Wikipedia, but I just like to sum it up here)

I was learning about how the NES programming worked, so I went to read about bytes, how each byte consists of two hexadecimal digits called 'nibbles'. When I read about nibbles, I saw this chart:
(http://upload.wikimedia.org/wikipedia/commons/b/bf/Octets_in_CP866_ordered_by_nibbles.png)
I looked at the bottom two rows, and it looked really familiar.

Later I went to read about Unicode. Turns out that there isn't just 10,000 characters. The index definition consists of 0 to 10FFFF, which is 1,114,112 indexes! However, many of the indexes doesn't have characters, but I'm pretty sure it still is a million.

Unicode is actually a product automatically installed in Windows. It consists of emoticons, symbols, the letters and numbers, japanese symbols, and your fellow Windows 96 window graphics. Also files have a bit format. They could have the normal 8-bit characters, which is all and only all the characters you see on the picture. (some characters are a bit different actually) But some are 16-bit, or even 32-bit. If you open a 32-bit file in an 8-bit editor, you should realize that each 4 characters is just 1 character.

A nifty little fact I found is that each time you press the Enter button, you are actually sending a 16-bit character. In ASCII, it is either (in the picture (HL)) 0A + 0D, or 0D + 0A. This is ◙♪/♪◙ in Unicode form. Windows uses 0D0A (hex) for newline. I looked in a hex editor, and found that I was correct. If you try to do Alt+3338, (0D0A) you just get an inverted circle, since it's not formatted to do that. (SMF's text editor only inputs the first 256 unicode characters; it wraps if you exceed 255) Google 'newline' if you don't get it.

All of this stuff is straight from Wikipedia.

Title: Re: Unicode Research and Programming
Post by: ARTgames on September 05, 2011, 06:00:34 PM

yeah dude you have the idea. Before Unicode it was a messy world. Now ASCII was ok but 1 byte was way too small to hold the resolution needed for each symbol for every language in the world that needed to be on a computer. And in other parts of the world they came up with other encoding methods that where not compatible with each other. Was a mess.

Then Unicode was invented and things got a lot easier. But you do still come across problems here and there. But still a great standard.

Stick Online Forums

General => Off Topic => Topic started by: RayRay on September 05, 2011, 04:53:55 PM