Apple's Natural Language Framework basics - Status of Tibetan support Mar 11 2019

The Natural Language framework, introduced by Apple in 2018, provides tools for developers to process and analyse text. The framework provides the following capabilities:

Identification of language and scripts
Tokenization
Tagging
Lemmatization
Name/entity recognition

I'm interested in Tibetan texts processing, so let us see how can we use this framework for Tibetan Text processing.

Language and script identification
Tokenization
Tagging
Lemmatisation
Name/entity recognition
Conclusions

Language and script identification

Identifying language and script is a simple matter of providing some text and asking the Natural Language framework (NL from now on) to identify the language. We can do this in two different ways; one is straight forward using NLLanguageRecognizer, the second one is using NLTagger setting up the tagScheme to obtain the language. The following code uses NLLanguageRecognizer:

1
2
3
4
5
6
7
8
9
10
11
12
import Foundation
import NaturalLanguage

var string = """
ང་དེ་ལ་སྒྲོ་ཡ་ཡིང།
"""

let languageRecognizer = NLLanguageRecognizer()
languageRecognizer.processString(string)
var languageCode = languageRecognizer.dominantLanguage!.rawValue

print("String's language: \(languageCode)") // String's language: bo

I'll explain NLTagger later, but just for the sake of completion I added the following example using the NLTagger:

1
2
3
4
5
6
7
8
9
10
11
12
import Foundation
import NaturalLanguage

var string = """
ང་དེ་ལ་སྒྲོ་ཡ་ཡིང།
"""

let tagger = NLTagger(tagSchemes: [ .lexicalClass, .language])
tagger.string = string
if case let (tag?, _) = tagger.tag(at: string.startIndex, unit: .document, scheme: .language) {
    print("String's Language: \(tag.rawValue)") // String's language: bo
}

Tokenization

An essential step in Natural Language Processing (NLP, from now on) is breaking a character sequence into chunks called tokens for further tagging and processing. Tokens can be words, sentences, paragraphs or any arbitrary sub-sequence of characters.

Look at the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
let string = """
Hello, my name is Derik. I live in 123 Guatemala
"""
let tokenizer = NLTokenizer(unit: .sentence)
tokenizer.string = string

let stringRange = string.startIndex ..< string.endIndex
tokenizer.enumerateTokens(in: stringRange) { (range, attributes) -> Bool in
    print("Look: \(string[range])")
    return true
}
//Look: Hello, my name is Derik. 
//Look: I live in 123 Guatemala

Alternatively, we could split by words and check for each word if it is numeric:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let string = """
Hello, my name is Derik. I live in 123 Guatemala
"""

let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = string

let stringRange = string.startIndex ..< string.endIndex
tokenizer.enumerateTokens(in: stringRange) { (range, attributes) -> Bool in
    if attributes.contains(NLTokenizer.Attributes.numeric) {
        print("Contains this number: \(string[range])")
    }
    return true
}
//Contains this number: 123

It doesn't work for Tibetan, not sure what is the current implementation but it shouldn't be hard to add numeric for words by checking the Unicode range of each character. Breaking into sentences is more complicated, how do we define a sentence in Tibetan? Maybe from Sha to Sha, we would have to consult with a linguist.

Word tokenisation seems to be working for Tibetan, but a more exhaustive examination is required.

Tagging

Once we've extracted tokens we can add information to those tokens; this process is called Tagging. We can tag tokens by lexical class (nouns, verbs, etc.), syntactically (word, punctuation, etc.) by its language characteristics (language and script).

The following is an example of tagging words by lexical class in English:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var string = """
Hello my name is Derik
It is a rainy day
"""

let tagger = NLTagger(tagSchemes: [ .lexicalClass])
tagger.string = string

let options: NLTagger.Options = [.omitWhitespace,
                                 .omitPunctuation]
let strRange = string.startIndex ..< string.endIndex
tagger.enumerateTags(in: strRange,
                     unit: .word,
                     scheme: .lexicalClass,
                     options: options) { (tag, tagRange) in
                        if let lexicalClass = tag?.rawValue {
                            print("\(string[tagRange]): \(lexicalClass)")
                        }
                        return true
}
//Hello: Interjection
//my: Determiner
//name: Noun
//is: Verb
//Derik: Noun
//It: Pronoun
//is: Verb
//a: Determiner
//rainy: Adjective
//day: Noun

Status for Tibetan: not working by default but we can train a MLWordTagger . Check this 2018 WWDC video - start at minute 14:30 - demo at 21:08) for an example.

Lemmatisation

Lemmatisation is the process of obtaining the base of a word, for example:

1
am, are, is - have the base: be

Let's look at an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var string = """
am is are
"""

let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = string

let options: NLTagger.Options = [.omitWhitespace,
                                 .omitPunctuation]
let strRange = string.startIndex ..< string.endIndex
tagger.enumerateTags(in: strRange,
                     unit: .word,
                     scheme: .lemma,
                     options: options) { (tag, tagRange) in
                        if let lemma = tag?.rawValue {
                            print("\(string[tagRange]): \(lemma)")
                        }
                        return true
}
//am: be
//is: be
//are: be

Status for Tibetan: not working, implementing this would require Apple to add support for Tibetan lemmatisation.

Name/entity recognition

We could add places and people to a custom CoreML model using the same logic we used for WordTagger, again check the 2018 WWDC video start at minute 14:30 - demo at 21:08 for an example. That way we could with the NL framework identify names and entities.

It would have to be a custom model. There is no entity recognition for Tibetan.

Conclusions

NaturalLanguage framework has a big range of tools that are useful for Natural Language Processing. If you are using a language like English or Spanish, you get more data out of the box. If you are using a different language, you get the base, but you will have to work with custom models. It is still much more straightforward than coming up with an NLP model on your own.

Apple has done a great job with Swift, it's text processing capabilities are remarkably good, and the Natural Language framework makes it exciting to work with. I'm looking forward to seeing how it will evolve and specifically to see it's support for Tibetan grow.

I'll try to keep this post updated when new features or news that come my way. If you have any questions or comments, don't hesitate to send me a message I'll be glad to talk about this topic.

** If you want to check what else I'm currently doing, be sure to follow me on twitter @rderik or subscribe to the newsletter. If you want to send me a direct message, you can send it to derik@rderik.com.