Apple's Natural Language Framework basics - Status of Tibetan support Mar 11 2019
The Natural Language framework, introduced by Apple in 2018, provides tools for developers to process and analyse text. The framework provides the following capabilities:
- Identification of language and scripts
- Tokenization
- Tagging
- Lemmatization
- Name/entity recognition
I'm interested in Tibetan texts processing, so let us see how can we use this framework for Tibetan Text processing.
Table of Contents
Language and script identification
Identifying language and script is a simple matter of providing some text and asking the Natural Language framework (NL from now on) to identify the language. We can do this in two different ways; one is straight forward using NLLanguageRecognizer
, the second one is using NLTagger
setting up the tagScheme
to obtain the language. The following code uses NLLanguageRecognizer
:
1
2
3
4
5
6
7
8
9
10
11
12
import Foundation
import NaturalLanguage
var string = """
ང་དེ་ལ་སྒྲོ་ཡ་ཡིང།
"""
let languageRecognizer = NLLanguageRecognizer()
languageRecognizer.processString(string)
var languageCode = languageRecognizer.dominantLanguage!.rawValue
print("String's language: \(languageCode)") // String's language: bo
I'll explain NLTagger
later, but just for the sake of completion I added the following example using the NLTagger
:
1
2
3
4
5
6
7
8
9
10
11
12
import Foundation
import NaturalLanguage
var string = """
ང་དེ་ལ་སྒྲོ་ཡ་ཡིང།
"""
let tagger = NLTagger(tagSchemes: [ .lexicalClass, .language])
tagger.string = string
if case let (tag?, _) = tagger.tag(at: string.startIndex, unit: .document, scheme: .language) {
print("String's Language: \(tag.rawValue)") // String's language: bo
}
Tokenization
An essential step in Natural Language Processing (NLP, from now on) is breaking a character sequence into chunks called tokens for further tagging and processing. Tokens can be words, sentences, paragraphs or any arbitrary sub-sequence of characters.
Look at the following example:
1
2
3
4
5
6
7
8
9
10
11
12
13
let string = """
Hello, my name is Derik. I live in 123 Guatemala
"""
let tokenizer = NLTokenizer(unit: .sentence)
tokenizer.string = string
let stringRange = string.startIndex ..< string.endIndex
tokenizer.enumerateTokens(in: stringRange) { (range, attributes) -> Bool in
print("Look: \(string[range])")
return true
}
//Look: Hello, my name is Derik.
//Look: I live in 123 Guatemala
Alternatively, we could split by words and check for each word if it is numeric:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
let string = """
Hello, my name is Derik. I live in 123 Guatemala
"""
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = string
let stringRange = string.startIndex ..< string.endIndex
tokenizer.enumerateTokens(in: stringRange) { (range, attributes) -> Bool in
if attributes.contains(NLTokenizer.Attributes.numeric) {
print("Contains this number: \(string[range])")
}
return true
}
//Contains this number: 123
It doesn't work for Tibetan, not sure what is the current implementation but it shouldn't be hard to add numeric for words by checking the Unicode range of each character. Breaking into sentences is more complicated, how do we define a sentence in Tibetan? Maybe from Sha to Sha, we would have to consult with a linguist.
Word tokenisation seems to be working for Tibetan, but a more exhaustive examination is required.
Tagging
Once we've extracted tokens we can add information to those tokens; this process is called Tagging. We can tag tokens by lexical class (nouns, verbs, etc.), syntactically (word, punctuation, etc.) by its language characteristics (language and script).
The following is an example of tagging words by lexical class in English:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
var string = """
Hello my name is Derik
It is a rainy day
"""
let tagger = NLTagger(tagSchemes: [ .lexicalClass])
tagger.string = string
let options: NLTagger.Options = [.omitWhitespace,
.omitPunctuation]
let strRange = string.startIndex ..< string.endIndex
tagger.enumerateTags(in: strRange,
unit: .word,
scheme: .lexicalClass,
options: options) { (tag, tagRange) in
if let lexicalClass = tag?.rawValue {
print("\(string[tagRange]): \(lexicalClass)")
}
return true
}
//Hello: Interjection
//my: Determiner
//name: Noun
//is: Verb
//Derik: Noun
//It: Pronoun
//is: Verb
//a: Determiner
//rainy: Adjective
//day: Noun
Status for Tibetan: not working by default but we can train a MLWordTagger . Check this 2018 WWDC video - start at minute 14:30 - demo at 21:08) for an example.
Lemmatisation
Lemmatisation is the process of obtaining the base of a word, for example:
1
am, are, is - have the base: be
Let's look at an example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
var string = """
am is are
"""
let tagger = NLTagger(tagSchemes: [.lemma])
tagger.string = string
let options: NLTagger.Options = [.omitWhitespace,
.omitPunctuation]
let strRange = string.startIndex ..< string.endIndex
tagger.enumerateTags(in: strRange,
unit: .word,
scheme: .lemma,
options: options) { (tag, tagRange) in
if let lemma = tag?.rawValue {
print("\(string[tagRange]): \(lemma)")
}
return true
}
//am: be
//is: be
//are: be
Status for Tibetan: not working, implementing this would require Apple to add support for Tibetan lemmatisation.
Name/entity recognition
We could add places and people to a custom CoreML model using the same logic we used for WordTagger, again check the 2018 WWDC video start at minute 14:30 - demo at 21:08 for an example. That way we could with the NL framework identify names and entities.
It would have to be a custom model. There is no entity recognition for Tibetan.
Conclusions
NaturalLanguage framework has a big range of tools that are useful for Natural Language Processing. If you are using a language like English or Spanish, you get more data out of the box. If you are using a different language, you get the base, but you will have to work with custom models. It is still much more straightforward than coming up with an NLP model on your own.
Apple has done a great job with Swift, it's text processing capabilities are remarkably good, and the Natural Language framework makes it exciting to work with. I'm looking forward to seeing how it will evolve and specifically to see it's support for Tibetan grow.
I'll try to keep this post updated when new features or news that come my way. If you have any questions or comments, don't hesitate to send me a message I'll be glad to talk about this topic.