Courier’s Speech Act Classifier: Modeling Intent

Paulo Malvar
5 min readOct 18, 2017

As pointed out in a previous post, conversational emails are defined as text-based asynchronous conversations. Drawing from John L. Austin and John R. Searle’s body of work on Speech-act theory, conversation utterances can be analyzed and classified, from a pragmatic point of view, according to their illocutionary force, that is, their intention and their effect in the world.

Using Stolcke et al. (2000) taxonomy of speech acts, we have created a simplified and refined list of speech acts that we can apply to sentences in emails so that we can exploit this information in order to produce email summaries that are relevant from a pragmatic point of view.

The simplified list of speech acts that we will used in this project is:

Command/Request: Utterances that compel someone to perform an action.
Commitment: Utterances that express the commitment to perform an action by an agent.
Question: Utterances that ask for information and/or opinion.
Statement: Utterances that simply convey information and/or opinions.
Desire/Need: Utterances that convey a desire or need that an agent has.
Other: Any other utterance that doesn’t fit in the previously defined classes.

Training Corpus

We compiled a corpus with a total of 492 emails: 246 emails from the Enron Corpus (Klimt & Yang, 2004) and 246 from one of our personal inboxes, used both for personal and business purposes. Content from these emails was parsed using our in-house email header and content parser, tokenized, sentence-split and finally annotated using the speech act categories defined above.

The distribution of classes after annotation is the following:

Speech acts distribution

From this graph, it’s interesting to note that the majority of annotated utterances belong to the Statement category, which reinforces the assumption that email is a communication medium used to convey information and opinions.

However, leaving this category aside, we find it relevant to note that the second most frequent category is a category (Command/Request) that accounts for utterances compeling someone to perform an action. As will be discussed in a future blog post, this has informed our decisions regarding what features to add to our email Client. In particular, we’re referring the capability of Courier to detect and highlight tasks that users receive via email, which are essential pieces of information, especially when users use their email as a medium of communication in their work environment.

The Classifier

At Codeq, during our long period of development for Courier, which comprised four years of intense research, we have had to build from scratch most of the technology that powers our Natural Language Processing (NLP) engine. This means that we have had to compile our own datasets, just like the one we used to train this speech act classifier. Given our limited resources and the complexity of this project, we have been forced to apply Machine Learning (ML) algorithms to small collections of data.

Working with what is known as Small Data has its own inherent set of challenges. If, when working with Big Data, one has to be able to make scale the engineered ML models; in a Small Data setup, one has to make do with the available data and apply a significant amount of effort to formalize linguistic and pragmatic knowledge in the form of features that can capture the nature of the classes one is trying to model.

For our speech act classifier we have engineered more than 50 manually crafted features that our ML classifier can use in order to maximize the probability of the predictions it makes.

Using the specialized set of features, we were able to achieve relatively good performance on our development corpus:

Initial performance on the development corpus.

However, to provide more robustness to the predictions we used a well-known ML technique called bootstrapping that consists of:

  1. Training an initial model
  2. Classifying unlabeled examples
  3. Adding classified unlabeled examples that have a classification probability equal or greater to a predefined threshold
  4. Iterating this process, starting with step 1, until no new bootstrapped data can be extracted and added to the training corpus

After 9 bootstrapping iterations using variable probability thresholds ranging from 0.93 to 0.97 we were able to add the following number of additional automatically labeled training instances:

Extra data automatically added to the training corpus.

As mentioned above, variable probability thresholds were assigned to each distinct class. Class probability thresholds assignments were calculated using an interpolation technique that assigns a number within a given range to each distinct class according to its distribution in the training corpus. This way, classes with a higher number of training examples got assigned higher probability thresholds than classes with fewer training examples. We followed this approach to account for the fact that dominating classes tend to have higher probability for their predictions. Thus, we make it harder for those dominating classes to bootstrap additional automatically labeled instances. In practice, this prevents those dominating classes from becoming even more dominant and skewing modeling predictions in their favor, which usually hurts other classes’ recall.

Results obtained on the development corpus after 9 bootstrapping iterations are as follows:

Performance on the development corpus after 9 rounds of bootstrapping.

The most significant change observed after bootstrapping for this model is the increase in recall for a couple of categories. While this change is not dramatic (results were already pretty high for the initial training), an increase in recall means an increase of the number of particular phenomena per category that the classifier is able to identify.

Finally, and in order to assess the generalization power of this classifier, we show below the results obtained for the test corpus, that we didn’t use in any way during training. Also, with the intent of maximizing precision in production we decided to use an aggressive probability threshold. After extensive experimentation we decided to use .8 as our threshold.

Performance on the test corpus with probability threshold 0.8.

Conclusion

In this blog post we have shown the procedure we used to train Courier’s speech act classifier.

We believe that capturing pragmatic phenomena when performing automated text analysis tasks is crucial for the success of Natural Languague Understanding projects.

In future blog post we’ll dive into other exciting features we’ve built to help us process and extract meaning from conversational text in the context of asynchronous conversations.

References

Austin, J.L. (1962). How to do things with words. Cambridge: Harvard University Press

Klimt, B., & Yang, Y. (2004). The Enron Corpus: A New Dataset for Email Classification Research. ECML, 3201. pp 217–226.

Searle, J.R (1969). Speech acts: an essay in the philosophy of language. Cambridge: Cambridge University Press.

Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., et al. (2000). Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3), 339–373.

--

--