current position:Home>Character coding knowledge that needs to be understood in front-end development

Character coding knowledge that needs to be understood in front-end development

2022-04-29 12:38:39jimojianghu

Character set and character encoding

A character set is a set of characters , As is common ASCII Character set ,GB2312 Character set ,Unicode Character set, etc. . The biggest difference between these different character sets is the number of characters they contain .

Character encoding represents the actual encoding rules of the character set , It is used for computer to parse characters , Such as GB2312,GBK,UTF-8 etc. . The essence of character coding is how to use binary bytes to represent characters .

Character set and encoding are one to many relationships , The same character set may have multiple character encodings , Such as Unicode The character set has UTF-8,UTF-16 etc. .

In front end development ,Javascript Procedure is to use Unicode Character set ,Javascript Source text is usually based on UTF-8 code .
but js The string type in the code is UTF-16 Coded , That's why I met api Interface return string is garbled at the front end , Because most services use utf-8 code , The encoding method is inconsistent .

Speaking of the development of character set , It can be summed up in one sentence : Almost all right ASCII Character set extension .

ASCII

We know , Computers use binary to process information .
among , Every binary bit (bit) Yes 0 and 1 Two kinds of state . A byte (byte) Then there are 8 Binary bits , There can be 256 States .

and ASCII It's based on the Latin alphabet 、 A single byte character set mainly used to display English , Its codes and characters are one-to-one correspondence , Because it just uses one byte 8 Two bits to represent , Not more than 256 Characters .

The standard ASCII The total number of characters is 128 Characters (2^7), In front of it 32 Control characters , Back 96 One is a printable character , Including common upper and lower case alphanumeric punctuation marks, etc . Because it only takes up the last of one byte 7 position , The highest bit of that byte is usually set to 0.

'a'.charCodeAt() // 97
'A'.charCodeAt() // 65
'9'.charCodeAt() // 57
'.'.charCodeAt() // 46

Above , Each character will correspond to a code ( Use digital identification ), A total of from 0-128. complete ASCII clock , It's easy to find .

adopt ASCII clock , We found that , Lowercase letters are not sorted next to uppercase letters ? This is to facilitate the conversion between uppercase and lowercase , A be ranked at 65(64 + 1) position , and a be ranked at 97(64 + 32 + 1) position .

65 ^ 32 = 97
// A ^ 32 = a

The development history of character set

ASCII Is the basis of almost all character sets .

The standard ASCII The code can only identify at most 128 Characters , European and American countries can use , But other countries have more characters , Naturally, it's not enough .
This is the time , The highest position began to be remembered , By extending the ASCII The highest bit of a code , It can also meet the needs of some countries for special symbols , This is the extension ASCII code .

But there are more non Latin speaking countries in Asia, Africa and Latin America , Thousands of characters , Only new ways can be used .
Such as Chinese , It is extended again , Less than 127 The meaning and standard of characters ASCII Same code , When it is necessary to identify Chinese characters , Use 2 Bytes , Each byte is greater than 127. This multi byte character set is GB2312, Later, because of continuous expansion , Such as traditional characters and various symbols , Even the language symbols of ethnic minorities and so on , Also used include GBK And other different character sets .
therefore , Many countries have developed their own coded character sets , It's all about ASCII Based on .

Although each character set is compatible with the standard ASCII code , But the inconvenience in using communication is obvious , Garbled code can also be seen everywhere . In order to solve this problem ,Unicode The character set was born .

Unicode

Unicode It was formulated by international organizations , A character set scheme for storing all the words and symbols in the world .
front 128 Characters are the same as ASCII equally , After expansion , Using a digital 0-0x10FFFF To map these characters , Up to 1114112 Characters . At present, only a small part of it is used .
Unicode Generally, two bytes are used to represent a character .

  • Code points
    Unicode Specifies the numeric number of each character , This number is called Code points (code point).
    Code point to U+hex Formal representation of ,U+ Represent Unicode The prefix of , and hex It's a 16 Hexadecimal number . The value range is from U+0000 To U+10FFFF.

Each code point corresponds to a character , Most of the common characters are in the front 65536 Characters , The scope is U+0000 To U+FFFF.
Generally, the code point interval of Chinese characters is U+2E80 - U+9FFF.

  • Character plane
    current Unicode Divided into 17 Two groups , Also known as plane , Each plane has 65536 A code point .
    The first plane is the Basic Multilingual plane , Range :U+0000 - U+FFFF, Most common characters are in this range .
    Other planes are auxiliary planes , Range :U+10000 To U+10FFFF, As we often see on the Internet Emoji expression .

  • Code element
    Code element (Code Unit) It can be understood as the smallest basic unit when encoding code points , The symbol is a whole . The function of character coding is to Unicode Code points are converted into symbol sequences .
    Unicode The common coding methods are UTF-8 、UTF-16 and UTF-32,UTF yes Unicode TransferFormat Abbreviation .
    UTF-8 yes 8 Single byte symbol of bit ,UTF-16 yes 16 Bit double byte symbol ,UTF-32 yes 32 Four byte symbol of bit .

Encoding mode Code element Number of bytes after encoding
UTF-88 position 1-4 byte
UTF-1616 position 2 Byte or 4 byte
UTF-3232 position 4 byte

in addition , Why do you always see the use of hexadecimal data to represent various data such as code points ?
because , The hexadecimal of two bits is exactly equal to one byte 8 position ,0xff = 0b11111111.

UTF-8

UTF-8 Is a variable length character encoding method . Now it's using 1 To 4 Bytes to encode characters .
It is the most widely used coding method in the Internet era , The front end contacts the most .

It should be noted that : Chinese characters generally account for 3 Bytes , Emoticons generally account for 4 Bytes .

UTF-8 The coding rules of :

  • 1 Characters in bytes , The first is 0, after 7 Bit is code point , And ASCII identical .
  • n Characters in bytes , Before the first byte n All places are 1,n+1 Is it 0, It can be judged that there are several bytes . The last few bytes are 10 Start with 2 position .
    All the rules here are prefixes , For the code point of the character , After interception, put other bits except the prefix in turn , therefore UTF-8 Also known as prefix code .
    The format is shown in the table :
Number of bytes Number of code points Code point range Encoding mode
17U+0000~U+007F0×××××××
211U+0080~U+07FF110××××× 10××××××
316U+0800~U+FFFF1110×××× 10×××××× 10××××××
421U+10000~U+10FFFF11110××× 10×××××× 10×××××× 10××××××

Through the coding rules in the above table , We can make all kinds of transformations .
Let's take the encoding conversion of a Chinese character as an example , Such as Chinese characters ' good ':

' good ' Of Unicode Code points :' good '.codePointAt() \\ 22909, The result is 22909;
22909 stay UTF-8 Of 3 The encoding range of the number of bytes U+0800 (2048) ~ U+FFFF (65535);
22909 The binary value of :101100101111101, Yes 15 position ;
and 3 The encoding of the number of bytes requires 16 position , Front complement 0, Divide into... According to the rules in the table 3 Group :0101 100101 111101;
Fill in the corresponding prefix :11100101 10100101 10111101, obtain 3 Bytes ;
Convert the obtained three bytes into hexadecimal data :E5 A5 BD, So Chinese characters ' good ' Of UTF-8 Namely :E5 A5 BD.

We use encodeURI To verify ————encodeURI The function supports Chinese translation UTF-8 code :

encodeURI(' good ') // '%E5%A5%BD'

Remove the percent sign , The results are exactly the same .

UTF-16

UTF-16 Coding method of : Character occupation of basic plane 2 Bytes (U+0000 To U+FFFF), Character occupation of auxiliary plane 4 Bytes (U+010000 To U+10FFFF).
in other words ,UTF-16 The code length of is either 2 A byte is either 4 Bytes . When it comes to 2 Byte time , Is actually related to Unicode identical .

And there's another principle , stay Unicode In the Basic Multilingual plane , from U+D800 To U+DFFF The code point interval between is not corresponding to the character . and UTF-16 It is necessary to use this code point to encode the characters of the auxiliary plane .
Its specific rules :

  • The code point is less than U+FFFF, Basic characters , No need to deal with , Use it directly , Two bytes .
  • otherwise , Split into two symbols , Four bytes ,cp Indicates the code point :
    1. Low position ——((cp - 65536) / 1024) + 0xD800, The value range is 0xD800~0xDBFF;
    2. High position ——((cp - 65536) % 1024) + 0xDC00, The value range is 0xDC00~0xDFFF.

See the following example :

  1. Chinese characters ' good ',' good '.codePointAt() // 22909, The code point is less than U+FFFF, Direct hexadecimal conversion :579D.
  2. emoticon '',''.codePointAt() // 128516, Code points need to be split :
    • Low position :Math.floor(((128516 - 65536) / 1024)) + 0xD800 // 55357, obtain D83D
    • High position :((128516 - 65536) % 1024) + 0xDC00 // 56836, obtain DE04

Use String.fromCharCode Method validation :

String.fromCharCode(0xD83D, 0xDE04)  // ''

One thing needs to be clear ,Javascript The string in is based on UTF-16 Coded , Large endian byte .

UTF-32 It's a fixed length code , Each code point is encoded with four bytes . The advantage is that it is with unicode One-to-one correspondence , The disadvantage is that it wastes too much space .

Compare

The letters... Will be selected below 、 Chinese characters 、 emoticons , Compare the codes to see :

// UTF-8
'a': 97 - 0x61
' good ': 22909 - (0xE5 0xA5 0xBD)
'': 128516 - (0xF0 0x9F 0x98 0x84)

// UTF-16
'a': 97 - 0x0061
' good ': 22909 - 0x597d
'': 128516 - (0xD83D, 0xDE04)

You can see ,UTF-8 It's getting longer 1-4 Bytes , The symbol is 8 position ;UTF-16 yes 2 or 4 byte , The symbol is 16 position .
Remember here UTF-16 Symbol of , For us to understand the following questions , It's helpful .

Coding in front-end development

As mentioned earlier ,javascript The string in is based on UTF-16 Coded , So when calculating the string length , We need to understand UTF-16 code .
Let's take a look at the problems you may encounter when processing strings .

String length calculation

A string of length attribute , It's actually using UTF-16 To calculate the number of symbols :

  • ASCII Most Chinese codes , It's all one symbol
  • Emoticons and other special characters are two symbols

So when a character exists 2 One symbol , Even if it's a character ,length But it's equal to 2.

'a'.length // 1
' good '.length // 1, Most Chinese characters are basic character planes , Only one symbol , The length is 1.
''.length // 2

Length of combined characters

There's a special kind of , Combination character , Generally refers to some characters with punctuation marks :é.

'é'.length // 2
'e\u0301'.length // 2

//  When obtaining code points , Ignoring punctuation , It shows the code points of letters 
'é'.codePointAt() // 101
'e'.codePointAt() // 101

For normal operation of combined characters , Use normalize().

'é'.normalize().length = 1.

Multi symbol character operation

For multi symbol characters, when subscript value is used , What you get will be its symbol :

''[0] // '\uD83D'
''[1] // '\uDE04'
'123'[0] // '1'

loop , Use for It's messy , and for-of It's normal :

let smile = ''
for(let i = 0; i < smile.length; i++) { 
  console.log(smile[i]) 
}
// �
// �

for (let tt of smile) {
  console.log(tt)
}
// 

but , Can be accessed by converting to an extended array :

[...''][0] // ''
Array.from('') // ['']

You can also use code points :

String.fromCodePoint(''.codePointAt()) // ''

For this special character , Using the following string methods will split symbols :
split(),slice(),charAt(),charCodeAt(),substr(),substring().

''.slice(0, 2) // ''
''.slice(0, 1) // '\uD83D'
''.slice(1, 2) // '\uDE04'
''.substr(0,1) // '\uD83D'
''.substr(0,2) // ''

''.split('') // ['\uD83D', '\uDE04']

In regularization u Modifier

ES6 Added... To the regular u Modifier , Used to properly handle larger than \uFFFF Of Unicode character .
That is, it can correctly handle four bytes UTF-16 code .

/^\S$/.test('') // false
/^\S$/u.test('') // true

But for combined characters ,u Modifiers don't work :

/^\S$/u.test('é') // false
/^\S$/u.test('e\u0301') // false

Escape character

We also need to pay attention to , Is the calculation of escape characters , The result will be based on the actual characters :

'\x3f'.length // 1
'?'.length // 1

When reading , It can also be handled normally :

'\x3f'[0] // '?'
'\x3f'.split('') // ['?']

Commonly used API

The front end is right Unicode When encoding , Provides some useful API, In practice , It will be convenient for us to deal with this problem .

Handle code points and characters

  • charAt(index)
    Returns the specified character from a string , For multi symbol characters , Symbol characters will still be returned :
'a'.charAt() // 'a'
''.charAt() // '\uD83D'
''.charAt(1) // '\uDE04'
  • charCodeAt(index)
    return 0 To 65535 Integer point value between . For multiple symbols, if the code point of the character is greater than U+FFFF, Returns the first symbol value , You can also add index parameters to get the value of the following symbols .
  • codePointAt(pos)
    return Unicode Code points , Multiple symbols can also return the complete code point value .codePointAt You can pass in index parameters , Take the second symbol value for multi symbol characters .
//  Less than  U+FFFF
' good '.codePointAt() // 22909
' good '.charCodeAt() // 22909

//  Greater than  U+FFFF
''.charCodeAt() // 55357
''.charCodeAt(1) // 56836
''.codePointAt() // 128516
''.codePointAt(1) // 56836
  • String.fromCharCode(num1[, ...[, numN]])
    Returns the specified by UTF-16 String created by code point sequence . Parameter range 0 To 65535, Greater than 65535 The data will be truncated , The result is not accurate .
    For multi symbol characters , Then the two symbols will be combined to obtain the character .
  • String.fromCodePoint(num1[, ...[, numN]])
    Returns a string created using the specified sequence of code points . It can handle the complete code point value of multi symbol characters .
String.fromCharCode(55357, 56836, 123) // '{'
String.fromCodePoint(128516, 123, 8776) // '{≈'

TextEncoder

TextEncoder, Use UTF-8 Encoding converts a code point stream into a byte stream .
TextDecoder: decode .
The default encoding method is UTF-8, It can solve the problem of character conversion UTF-8 Coding problem .

const txtEn = new TextEncoder()
const enVal = txtEn.encode(' good ')
// Uint8Array(3) [229, 165, 189]
const txtDe = new TextDecoder()
txtDe.decode(enVal) // ' good '

IE I won't support it .

String.prototype.normalize()

For intonation and accents ,Unicode There are two ways , One is to directly provide signed characters , Such as é ( Code points 233); The other is combining characters , As mentioned above ( Code points 101).
For this code point is different , But essentially the same characters ,Javascript Can't recognize :

'é' === 'é' // false

and normalize() The introduction of methods , It is to solve this problem , It will unify the different representation methods of characters into a standard form in a certain way :

'é' === 'é'.normalize() // true

URL Of UTF8 codec

in addition , In the web pages often contacted by the front end ,URL Link coding is also very common . Such as :'http%3A%2F%2Fbaidu.com%2F%E4%B8%AD%E5%9B%BD'. What is involved here is about UTF-8 The coding .
and JavaScript There are four URL The coding / Decoding method , Can be used to convert non ASCII Code characters , Such as Chinese characters 、 Special characters 、 Emoticons, etc , Conduct UTF-8 Codec operation of :

  • encodeURI() and encodeURIComponent()
  • decodeURI() and decodeURIComponent()

Their weaknesses are also obvious , Yes ASCII Characters such as English numbers cannot be processed .
The conversion mode here : First to UTF-8 Bytecode , Then add a % The coding result is obtained by splicing .

encodeURI(' good ') // '%E5%A5%BD'
decodeURI('%E5%A5%BD') // ' good '
encodeURIComponent(' good ') // '%E5%A5%BD'
decodeURIComponent('%E5%A5%BD') // ' good '
encodeURI('hello') // 'hello'
encodeURIComponent('hello') // 'hello'
encodeURIComponent('') // '%F0%9F%98%84'
encodeURI and encodeURIComponent The difference between

The difference between the two , Lies in the right part URL Processing of metacharacter symbols .

URL Metacharacters : A semicolon (;), comma (’,’), Slash (/), question mark (?), The colon (:),at(@),&, Equal sign (=), plus (+), Dollar symbol ($), Well No (#).

encodeURIComponent It's going to be about these URL Encode metacharacters , however encodeURI Will not be :

encodeURIComponent(';,/@&=') // '%3B%2C%2F%40%26%3D'
encodeURI(';,/@&=') // ';,/@&='

copyright notice
author[jimojianghu],Please bring the original link to reprint, thank you.
https://en.qdmana.com/2022/119/202204291108291273.html

Random recommended