Using ICU transforms in Swift Feb 28 2019
The International Components for UnicodeICU provides powerful libraries for working with Unicode. In this post, I'll explain the basics of how to use ICU transforms in Swift. There are many areas to explore in ICU and Swift, but we will focus on using the transforms defined in ICU User Guide. There is currently no support for arbitrary ICU transforms, but the basics are handy.
In a future post I will discuss the use of Unicode properties, in Swift 5 we will get Unicode properties added to Unicode.Scalar and that will give us access to more ways to get information on what our Strings represent and create software that handles better the use of different scripts and languages. But for now, let's focus on ICU transforms.
Table of Contents
ICU transforms
In it's most basic sense ICU transforms provide the ability to perform tasks that process Unicode text. It gives the ability to change case(Uppercase, lowercase, title case, etc.), script conversion(changing Latin characters to Greek characters - Latin-Greek;), etc. These transforms can be expressed by rules defined in the ICU User Guide.
ICU Examples
Cases Examples:
Change all characters to lower case:
1
Lower; # Would change "Hello, World!" to "hello, world!"
Change all characters to upper case:
1
Upper; # Would change "Hello, World!" to "HELLO, WORLD!"
Change all characters to title case:
1
Title; # Would change "HELLO, world!" to "Hello, World!"
Normalization Examples:
Characters that appear on the screen might be constructed in different ways; for example, the letter á can be constructed in the following two ways:
1
2
á U+00E1 \N{LATIN SMALL LETTER A WITH ACUTE}
á U+0061U+0301 \N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
They are equivalents. To compare them, we need to use normalisation functions. We can decompose á - U+00E1 using the function NFD to obtain U+0061U+0301 and then compare it to á - U+0061U+0301. We could, in the same manner, compose á - U+0061U+0301 using the function NFC to á - U+00E1 and be able to compare them.
This is how the ICU transforms will look:
1
2
NFD; # D for Decompose
NFC; # C for Compose
Ok, after this brief introduction, how do we interact with ICU in Swift? Let's talk about it now.
Swift and ICU
In Swift, String has access to ICU transforms using the method applyingTransform which has the following signature:
1
func applyingTransform(_ transform: StringTransform, reverse: Bool) -> String?
The parameter StringTransform is a struct that represents ICU transforms (transliterations, stripDiacritics, etc.). To see the full list check Apple's documentation StringTransform .
For example, if we would like to remove the accents from a string we could do something like this:
1
2
let tree = "árbol"
let strippedTree = tree.applyingTransform(.stripDiacritics, reverse: false) //arbol
or we could use transliterations, like this:
1
2
let hello = "你好"
let transliteration = hello.applyingTransform(.mandarinToLatin, reverse: false) //nǐ hǎo
That in itself is quite powerful, but not all of the ICU basic transforms are accessible, there is no StringTransform for decompose or change case as we saw on the previous examples. How do we access them? We use NSMutableString and applyTransform notice the difference applyTransform, and previously we used applayingTransform. This is the signature of the method:
1
func applyTransform(_ transform: StringTransform, reverse: Bool, range: NSRange, updatedRange resultingRange: NSRangePointer?) -> Bool
With that in mind we could do the following:
1
2
3
4
5
let greeting = NSMutableString(string: "HELLO, WORLD!")
var range = NSRange(location: 0, length: greeting.length )
let titleCase = StringTransform(rawValue: "Title;")
greeting.applyTransform(titleCase, reverse: false, range: range, updatedRange: &range)
print(greeting) //Hello, World!
That looks more complex only to get a titlelized string and it is, but the real power comes with the ability to combine rules:
1
2
3
4
5
let question = NSMutableString(string: "TE GUSTÓ?")
var questionRange = NSRange(location: 0, length: question.length)
let titleStripDiacritics = StringTransform(rawValue: "Lower; NFD; [:M:] Remove;")
question.applyTransform(titleStripDiacritics, reverse: false, range: questionRange, updatedRange: &questionRange)
print(question) //te gusto?
That is useful, we change the case to lowercase and we apply some additional transforms. First, we decomposed the characters, then removed the diacritics and with that changed the meaning of the question "Did you like it?" to "Do you like me?".
We can get more information like this:
1
2
3
4
5
6
7
8
9
10
11
12
let string = NSMutableString(string: "á")
var stringRange = NSRange(location: 0, length: string.length)
let nameTransform = StringTransform(rawValue: "Any-Name")
string.applyTransform(nameTransform, reverse: false, range: stringRange, updatedRange: &stringRange)
print(string) //"\\N{LATIN SMALL LETTER A WITH ACUTE}\n"
let string2 = NSMutableString(string: "á")
var stringRange2 = NSRange(location: 0, length: string2.length)
let nameTransform2 = StringTransform(rawValue: "NFD; Any-Name;")
string2.applyTransform(nameTransform2, reverse: false, range: stringRange2, updatedRange: &stringRange2)
print(string2) //"\\N{LATIN SMALL LETTER A}\\N{COMBINING ACUTE ACCENT}\n"
I think you can see how useful it is to have access to the ICU base transforms via NSMutableString and applyTransform. Hope this was helpful, you can find the playground with all the examples on my GitHub https://github.com/rderik/rderikUnicodeICUPlayground let me know if you have any questions or comments.