Courier’s Question Classifier

8 min readNov 13, 2017

As an extension to our speech act classifier that we introduced in a previous blog post, in Courier we have implemented a module that further categorizes information requests, that is, questions, into different types.

This module has three main purposes:

1) identify questions in emails that need to be answered by a user when responding to email conversations,

2) provide finer-grained information about the nature of sentences to our email conversation summarizer; and

3) allow for more advanced post-processing rules that select relevant sentences to be included in email summaries

Training Corpora

To train this question type classifier we extracted information requests from two relatively well-known corpora:

- ICSI-MRDA (Shriberg et al, 2004): The ICSI MRDA Corpus consists of hand-annotated dialog act, adjacency pairs, and hotspot labels for the 75 meetings in the ICSI meeting corpus.

- Switchboard SWBD-DAMSL (Jurafsky et al, 1997): The Switchboard corpus is composed of approximately 2,400 telephone conversations between unacquainted adults.

The tagset used by these two corpora is shared and it is discussed in detail below. Since both corpora are transcriptions of spoken conversations, the sentences contained in both corpora lack the grammaticality, consistency and polish of written conversations. In order to adapt the extracted information requests to a written conversation setup like email conversations, we manually cleaned the corpora to remove repetitions, vacillations and other artifacts which are typical of spoken conversations and that their transcriptions reflect.

Tagset

Our question type classifier inherits its tagset from the SWBD-DAMSL, which is also shared by the ICSI-MRDA corpus. The definition of the classes defined by this set of tags can be found in “Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation. Coders Manual, Draft 13” (Jurafsky et al, 1997).

The original SWBD-DAMSL tagset had the following set of tags: qy, qw, qo, qr, qrr, d, g and qh (described below). Except “d” and “g” classes (declarative and tag questions, respectively), all other classes are mutually exclusive. This means that a question labeled, for example, as “qy” can also be labeled “d” and/or “g”, but cannot be labeled as “qo”, “qw” or “qr” at the same time. Drawing from this original set of tags and respecting the mutual exclusivity of the original question types, we reviewed, restructured and adapted the annotations of both ICSI-MRDA and SWDA-DAMSL to contain the following set of annotations:

1) Yes/No questions, label “qy”

“qy” is used for yes-no questions only if they both have the pragmatic force of a yes-no-question and if they have the syntactic and prosodic markings of a yes-no question (i.e. subject-inversion, question intonation).

2) Wh- questions, label “qw”

Wh-interrogative questions. These must have subject-inversion. “Echo-questions” with wh-in-place are considered “declarative questions” (marked with d, see below).

3) Open-ended questions, label “qo”

Open-ended questions are meant to address the kind of questions which we think place few if any syntactic constraints on the form of the answer.

4) Or-questions, label “qr”

Or-questions are questions that offer 2 or more choices as answers to the listener or reader. We have restructured the original tagset so that qrr questions, which are “or-clause tacked on after a y/n question”, are considered or-questions too.

5) Declarative questions, label “d”

Declarative questions are utterances which function pragmatically as questions but that do not have “question form” and/or subject-verb inversion.

6) Tag questions, label “g”

A ‘tag’ question consists of a statement and a ‘tag’ which seeks confirmation of the statement. Because the tag gives the statement the force of a question, the tag question is coded ‘qy^g’. The tag may also be transcribed as a separate slash unit, in which case it is coded ‘^g’.

7) Rhetorical questions, label “qh”

The original category of Information request defined for the SWBD-DAMSL corpus did not considered rhetorical questions as true questions as this type of questions does not fall within the definition of “utterances that are jointly pragmatically, semantically, and syntactically questions.”

We acknowledge that rhetorical questions present a challenge for a classifier purely based on lexical and morphological features, given that they “are examples of utterances whose form does not match their function. They have the structure of a question but the force of an assertion and so are generally defined as questions that neither seek information nor elicit an answer. […] This makes them unique within semantic and pragmatic analyses since most utterances are assumed to be informative or at least information-seeking.” (Rohde, 2006: 134)

In our opinion, even though rhetorical questions are not pragmatically and semantically questions, they are syntactically expressed as questions and not attempting to classify them would lead the classifier to assign one of the other supported question classes. Therefore, we believe that it is necessary to classify this type of questions in order to set them apart from the set of true information requests.

8) Task questions, label “qt”

This is a new category that was not present in the original SWBD-DAMSL tagset. This category includes questions originally classified as rhetorical questions, because they lack the intention of seeking information but they encapsulate an action request: “Why don’t you go ahead and call him today?”.

This new category allows us to distinguish questions traditionally considered as rhetorical, such as “Who cares?” or “Who knows?”, from questions that intend to compel the listener or reader to take an action.

The distribution of labels/classes across the cleaned and reviewed instances of the ICSI-MRDA and Switchboard SWBD-DAMSL corpora is as follows:

The Classifier

Since we need to identify 8 different types of questions, where some classes are mutually exclusive, we adopted a two-phased training and fine tuning approach:

1) a chained binary classifier approach,

2) a probability-based approach that prevents questions from being assigned mutually exclusive question type labels.

Chained Binary Classifiers

For the first training phase, we trained 8 different binary classifiers by creating 8 different splits from the clean set of questions extracted from the above-mentioned corpora. Each split contains sentences that belong to the category we are trying to identify (class 1) and sentences labeled as a different question type (class 0). Splits were designed to be balanced in terms of the number of sentences assigned to class 1 and class 0. Thus, if class 1 was misrepresented with respect to class 0, class 0’s size is shrunk to match class 1’s size, and vice versa.

As already discussed in a previous post, given our limited resources and the complexity of this project, we have been forced to apply Machine Learning (ML) algorithms to small collections of data. The question type classifier that we’re discussing here is no different in this respect, as the number of training examples is not very high, especially for some of the question types we’re trying to learn.

Thus, in a Small Data setup like this one, one has to make do with the available data and apply a significant amount of effort to formalize linguistic and pragmatic knowledge in the form of features that can capture the nature of the classes one is trying to model.

For our question classifier we have engineered more than 30 manually crafted features that our ML classifier can use in order to maximize the probability of the predictions it makes.

Besides lexical and morphological features that capture linguistic phenomena, we have also implemented features that inform the classifiers if a question has been assigned another question type label that belongs to the set of mutually exclusive types. In practice, this is achieved by chaining the classifiers in a cascade-like fashion. For example, if a question has been classified as ‘qy’, the next classifier, the ‘qw’ classifier can use a feature that let’s it know that the question has been already assigned the ‘qy’ label. The classifier uses this information along with the rest of the features to ouput a new label for the question with a particular probability . Assigned question types are later refined using the probability for each type assignment in order to decide which question type is more likely.

At training time the compiled corpus is dynamically split into training (92% of the corpus), development and test corpora (4% of the corpus respectively). Results achieved for each individual binary classifier on the development corpus are as follows:

Individual binary classifiers results on the development corpus.

Although these results look very good for each individual binary classifier, global precision and recall don’t look as good when we put together all these classifiers in the cascade-like fashion, as described above. In the next section we’ll discuss how we improved its performance by thresholding the classifiers and by taking into account the fact that there are some mutually exclusive classes.

Global performance on the development corpus without thresholded predictions and mutual exclusivity knowledge.

Probability-based Thresholding And Mutual Exclusivity

In order to boost the overall performance of the chained binary classifiers and, most importantly, to ensure that no mutually exclusive classes are assigned to the same question, we conducted a second training phase by fine tuning question type assignment using different probability thresholds for each individual classifier.

Using thedevelopment corpus, we conducted a set of experiments to find the optimal class assignment probability threshold for each question type. In addition, we added code that allows only the most likely mutually exclusive class to be assigned to each question.

Finally, we added an extra class, ‘UNKNOWN’, that gets assigned when no other class probability is high enough. This lowers overall recall, although not substantially, but increases precision.

Results obtained for this second training phase are as follows:

Global performance on the development corpus with thresholded predictions and mutual exclusivity knowledge.

Finally, below we show the final results obtained for the test corpus applying the same setup found to be optimal for the development corpus:

Global performance on the test corpus with thresholded predictions and mutual exclusivity knowledge.

Conclusion

We have shown in this blog post how we trained our question type classifier and how we had to come up with an original setup to overcome challenges like the mutual exclusivity of the classes we needed to learn.

Creative solutions like the implemention of binary classifiers chained in a cascade-like fashion were not a choice to showcase our expertise but an absolute necessity in order to train a well performing classifier.

References

Jurafsky, D., Shriberg, E., and Biasca, D. (1997). Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation Coders Manual, Draft 13, University of Colorado, Boulder. Institute of Cognitive Science Technical Report 97–02. http://web.stanford.edu/~jurafsky/ws97/manual.august1.html (accessed 10–17–2017)

Shriberg, E., Dhillon, R., Bhagat, S., Ang, J., & Carvey, H. (2004). The ICSI Meeting Recorder Dialog Act (MRDA) Corpus. Presented at the 5TH SIGDIAL WORKSHOP ON DISCOURSE AND DIALOGUE.

Rohde, H. (2006). Rhetorical questions as redundant interrogatives. San Diego Linguistic Papers, (2), 134–168.