Objective:
Given an input text (utterance), list of words or regex patterns (entities) and the rules defined by the combination of entities (intents). Match the utterance with the corresponding intent.
Motivation:
Though this problem sounds just a text parsing problem, where we just need to do lookup and regex matches, but think of a scenario, where we have M lists of entities for lookups and N intents, which are combination of M entities with the following possibilties:
Intent (I1):
- Necessary to have entities E1 and E2
- E3 is not necessary, but optional
- Should have either of E4 and E5
Now, we can think of this an algorithmic challenging problem which very high time complexity. Thus, we borrow some graph datas tructures to address these types of issues for us.
Techniques:
In this section we will discuss the following modules, which are the backbone of this Rule Parsing Engine.
- Entity: Lets define an entity (E1). E1 can have list of words and phrases and then given a word, we need to check if it exists or not. Notation,
E ε [e1, e2, e3, ..., el]
.
Pythonic way,# Tuple of entity name and corresponding list of words. E1 = ('E1', [e1, e2, e3, e4])
To address this problem, we will use a data structure called Trie and match the string similarity based on edit distance.
What is Edit Distance ?
Given two strings str1 and str2 and below operations that can performed on str1. Find minimum number of edits (operations) required to convert ‘str1’ into ‘str2. Following are the allowed operations, having equal cost:- Add
- Delete
- Substitute
Reference to minimum edit distance
What is Trie ?
A Trie, (also known as a prefix tree) is a special type of tree used to store associative data structures. A trie (pronounced try) gets its name from retrieval — its structure makes it a stellar matching algorithm.Now we will use both of these approaches to solve the above problem. Let’s see what we want to achieve with the data structure.
trie = Trie(edit_distance_max=1) trie.insert("helloworld") results = list(trie.lookup("helloworl")) results = list(trie.lookup("elloworld"))
Now we have got a lookup for entities, we want to have entities with no overlaps with the neighbors, i.e. given a list of tagged entities,expand out valid parse results. A parse result is considered valid if it contains no overlapping spans. Since total confidence of a parse result is based on the sum of confidences of the entities, there is no sense in yielding any potential parse results that are a subset/sequence of a larger valid parse result. Thus, we will leverage “Expander Graphs”.