UTF16

struct UTF16

A codec for translating between Unicode scalar values and UTF-16 code units.

Inheritance UnicodeCodec View Protocol Hierarchy →
Associated Types
CodeUnit = UInt16

A type that can hold code unit values for this encoding.

Import import Swift

Initializers

init()

Creates an instance of the UTF-16 codec.

Declaration

init()

Static Methods

static func encode(_:into:)

Encodes a Unicode scalar as a series of code units by calling the given closure on each code unit.

For example, the musical fermata symbol ("𝄐") is a single Unicode scalar value (\u{1D110}) but requires two code units for its UTF-16 representation. The following code encodes a fermata in UTF-16:

var codeUnits: [UTF16.CodeUnit] = []
UTF16.encode("𝄐", into: { codeUnits.append($0) })
print(codeUnits)
// Prints "[55348, 56592]"

Parameters: input: The Unicode scalar value to encode. processCodeUnit: A closure that processes one code unit argument at a time.

Declaration

static func encode(_ input: UnicodeScalar, into processCodeUnit: (UTF16.CodeUnit) -> Swift.Void)
static func isLeadSurrogate(_:)

Returns a Boolean value indicating whether the specified code unit is a high-surrogate code unit.

Here's an example of checking whether each code unit in a string's utf16 view is a lead surrogate. The apple string contains a single emoji character made up of a surrogate pair when encoded in UTF-16.

let apple = "🍎"
for unit in apple.utf16 {
    print(UTF16.isLeadSurrogate(unit))
}
// Prints "true"
// Prints "false"

This method does not validate the encoding of a UTF-16 sequence beyond the specified code unit. Specifically, it does not validate that a low-surrogate code unit follows x.

x: A UTF-16 code unit. Returns: true if x is a high-surrogate code unit; otherwise, false.

See Also: UTF16.width(_:), UTF16.leadSurrogate(_:)

Declaration

static func isLeadSurrogate(_ x: UTF16.CodeUnit) -> Bool
static func isTrailSurrogate(_:)

Returns a Boolean value indicating whether the specified code unit is a low-surrogate code unit.

Here's an example of checking whether each code unit in a string's utf16 view is a trailing surrogate. The apple string contains a single emoji character made up of a surrogate pair when encoded in UTF-16.

let apple = "🍎"
for unit in apple.utf16 {
    print(UTF16.isTrailSurrogate(unit))
}
// Prints "false"
// Prints "true"

This method does not validate the encoding of a UTF-16 sequence beyond the specified code unit. Specifically, it does not validate that a high-surrogate code unit precedes x.

x: A UTF-16 code unit. Returns: true if x is a low-surrogate code unit; otherwise, false.

See Also: UTF16.width(_:), UTF16.leadSurrogate(_:)

Declaration

static func isTrailSurrogate(_ x: UTF16.CodeUnit) -> Bool
static func leadSurrogate(_:)

Returns the high-surrogate code unit of the surrogate pair representing the specified Unicode scalar.

Because a Unicode scalar value can require up to 21 bits to store its value, some Unicode scalars are represented in UTF-16 by a pair of 16-bit code units. The first and second code units of the pair, designated leading and trailing surrogates, make up a surrogate pair.

let apple: UnicodeScalar = "🍎"
print(UTF16.leadSurrogate(apple)
// Prints "55356"

x: A Unicode scalar value. x must be represented by a surrogate pair when encoded in UTF-16. To check whether x is represented by a surrogate pair, use UTF16.width(x) == 2. Returns: The leading surrogate code unit of x when encoded in UTF-16.

See Also: UTF16.width(_:), UTF16.trailSurrogate(_:)

Declaration

static func leadSurrogate(_ x: UnicodeScalar) -> UTF16.CodeUnit
static func trailSurrogate(_:)

Returns the low-surrogate code unit of the surrogate pair representing the specified Unicode scalar.

Because a Unicode scalar value can require up to 21 bits to store its value, some Unicode scalars are represented in UTF-16 by a pair of 16-bit code units. The first and second code units of the pair, designated leading and trailing surrogates, make up a surrogate pair.

let apple: UnicodeScalar = "🍎"
print(UTF16.trailSurrogate(apple)
// Prints "57166"

x: A Unicode scalar value. x must be represented by a surrogate pair when encoded in UTF-16. To check whether x is represented by a surrogate pair, use UTF16.width(x) == 2. Returns: The trailing surrogate code unit of x when encoded in UTF-16.

See Also: UTF16.width(_:), UTF16.leadSurrogate(_:)

Declaration

static func trailSurrogate(_ x: UnicodeScalar) -> UTF16.CodeUnit
static func transcodedLength(of:decodedAs:repairingIllFormedSequences:)

Returns the number of UTF-16 code units required for the given code unit sequence when transcoded to UTF-16, and a Boolean value indicating whether the sequence was found to contain only ASCII characters.

The following example finds the length of the UTF-16 encoding of the string "Fermata 𝄐", starting with its UTF-8 representation.

let fermata = "Fermata 𝄐"
let bytes = fermata.utf8
print(Array(bytes))
// Prints "[70, 101, 114, 109, 97, 116, 97, 32, 240, 157, 132, 144]"

let result = transcodedLength(of: bytes.makeIterator(),
                              decodedAs: UTF8.self,
                              repairingIllFormedSequences: false)
print(result)
// Prints "Optional((10, false))"

Parameters: input: An iterator of code units to be translated, encoded as sourceEncoding. If repairingIllFormedSequences is true, the entire iterator will be exhausted. Otherwise, iteration will stop if an ill-formed sequence is detected. sourceEncoding: The Unicode encoding of input. repairingIllFormedSequences: Pass true to measure the length of input even when input contains ill-formed sequences. Each ill-formed sequence is replaced with a Unicode replacement character ("\u{FFFD}") and is measured as such. Pass false to immediately stop measuring input when an ill-formed sequence is encountered. Returns: A tuple containing the number of UTF-16 code units required to encode input and a Boolean value that indicates whether the input contained only ASCII characters. If repairingIllFormedSequences is false and an ill-formed sequence is detected, this method returns nil.

Declaration

static func transcodedLength<Input, Encoding where Input : IteratorProtocol, Encoding : UnicodeCodec, Encoding.CodeUnit == Input.Element>(of input: Input, decodedAs sourceEncoding: Encoding.Type, repairingIllFormedSequences: Bool) -> (count: Int, isASCII: Bool)?
static func width(_:)

Returns the number of code units required to encode the given Unicode scalar.

Because a Unicode scalar value can require up to 21 bits to store its value, some Unicode scalars are represented in UTF-16 by a pair of 16-bit code units. The first and second code units of the pair, designated leading and trailing surrogates, make up a surrogate pair.

let anA: UnicodeScalar = "A"
print(anA.value)
// Prints "65"
print(UTF16.width(anA))
// Prints "1"

let anApple: UnicodeScalar = "🍎"
print(anApple.value)
// Prints "127822"
print(UTF16.width(anApple))
// Prints "2"

x: A Unicode scalar value. Returns: The width of x when encoded in UTF-16, either 1 or 2.

Declaration

static func width(_ x: UnicodeScalar) -> Int

Instance Methods

mutating func decode(_:)

Starts or continues decoding a UTF-16 sequence.

To decode a code unit sequence completely, call this method repeatedly until it returns UnicodeDecodingResult.emptyInput. Checking that the iterator was exhausted is not sufficient, because the decoder can store buffered data from the input iterator.

Because of buffering, it is impossible to find the corresponding position in the iterator for a given returned UnicodeScalar or an error.

The following example decodes the UTF-16 encoded bytes of a string into an array of UnicodeScalar instances. This is a demonstration only---if you need the Unicode scalar representation of a string, use its unicodeScalars view.

let str = "✨Unicode✨"
print(Array(str.utf16))
// Prints "[10024, 85, 110, 105, 99, 111, 100, 101, 10024]"

var codeUnitIterator = str.utf16.makeIterator()
var scalars: [UnicodeScalar] = []
var utf16Decoder = UTF16()
Decode: while true {
    switch utf16Decoder.decode(&codeUnitIterator) {
    case .scalarValue(let v): scalars.append(v)
    case .emptyInput: break Decode
    case .error:
        print("Decoding error")
        break Decode
    }
}
print(scalars)
// Prints "["\u{2728}", "U", "n", "i", "c", "o", "d", "e", "\u{2728}"]"

input: An iterator of code units to be decoded. input must be the same iterator instance in repeated calls to this method. Do not advance the iterator or any copies of the iterator outside this method. Returns: A UnicodeDecodingResult instance, representing the next Unicode scalar, an indication of an error, or an indication that the UTF sequence has been fully decoded.

Declaration

mutating func decode<I where I : IteratorProtocol, I.Element == UTF16.CodeUnit>(_ input: inout I) -> UnicodeDecodingResult