Deep Learning in Courier: Thoughts, Tips And Whatnot

Published in

Becoming Human: Artificial Intelligence Magazine

8 min readNov 21, 2017

Courier features a complex Natural Language Processing and Understanding system that allows users to spend less time reading email.

When we started working on this project we only had handful of tools, like our tokenizer and sentence splitter; so we had to build pretty much the entire system from the ground up. Our guiding philosophy is that we build all modules that are essential and for which we have the necessary resources and knowledge. This philosophy has allowed us to have complete control of our software stack, granting us the power to customize all components to whatever level we have needed.

The Big Migration

During four years of intense development we’ve built many, many Machine Learning (ML) modules in Courier. When we started this project, though, the landscape in terms of ML popularity was very different. It is true that ML was starting to attract attention from a larger audience, but it was far from the ubiquity that it enjoys nowadays.

In 2013, Deep Learning (DL) was just starting to gain some popularity but the barrier of entry was very high and only a handful of super specialized practitioners had access to it.

At that time, other families of ML algorithms were the cool kids on the block, namely Logistic Regression, Support Vector Machines, Random Forests, Gaussian Mixtures, among others. Working with these powerful algorithms entailed in many cases a laborious process of feature engineering in order to capture all linguistic phenomena we needed to model.

In 2015, everything changed, though. DL learning tools went through a process of democratization that allowed a much larger audience to get access to the power of this new family of ML algorithms. Keras was released in March of that year and that provided much easier access to Theano to many ML practitioners as a tool for applying DL algorithms to many different tasks. Later that year, Google released Tensorflow, which was quickly adopted by Keras as a secondary backend engine.

At Codeq we started retraining many of our classifiers to take advantage of these new powerful tools. But soon we realized that there were many considerations that needed to be made for that migration to be possible.

A Radical Paradigm Shift

Typical Feedforward neural network architecture.

It is true that one can just implement a simple deep feedforward neural network with fully connected dense layers that can learn far more complex functions than our old Support Vector Machines (SVM) models. But if one wants to take advantage of more powerful algorithms like Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN) that implies a way more radical paradigmatic shift.

For starters, one needs a lot, I repeat, a lot more data to use raw text as input for these algorithms. If feature engineering was the most time consuming task when working, for example, with SVM models; data collection and annotation had to be ramped up when we started training more complex DL models.

But not everybody is Google or Facebook in terms of the amount of data that one has access to; nor does everybody have the resources to annotate tens of thousands of data points to feed to these data hungry algorithms.

Processing time comparison between a SVM and a RNN model trained for emotion classification.

Another important consideration to make has to do with the performance one needs out of a ML classifier when deployed in production. Not performance in terms of precision and recall, although that is very important too, but performance in running time.

When we started migrating our models to the Keras+Tensorflow combo, we noticed a substantial increase in processing time needed to completely analyzed emails. What used to take centiseconds now took several deciseconds. Moving from a standard SVM setup trained on a few thousand examples to a DL architecture that used one or two LSTM layers, meant an increase of around 3 to 5 times longer processing times when running a ML model to classify new, unseen data.

Processing time comparison between a CNN and a RNN model trained for emotion classification.

We’ve spent a significant amount of time researching how to optimize our DL architectures so that we could use them in production. One strategy is to use, when possible, CNN over RNN. CNN and RRN conceptualize language in very different ways. CNN treats language as a bag of chunks (more or less like a bag of ngrams), where as RNN conceptualizes language as a sequence of tokens. So if it’s possible to get equally good performance when solving a problem simply using a CNN architecture, one can run a model in production 3 to 5 times faster than a comparable model trained using a RNN architecture, depending of course on the complexity of the models.

Deep Learning in Courier

A modern but modest ML practitioner who has experience with both the “old” ML algorithms (like SVM) and the new and shiny DL algorithms still needs to study and understand each problem that he/she wants to solve and decide what path is the best to take.

Courier is in fact a testament to this!

Courier’s emotion classifier RNN architecture.

We have built modules that feature complex DL algorithms like our emotion classifier that is powered by a single RNN+Attention encoder with 6 different output layers, or like our task classifier that uses a deep CNN architecture to identify tasks in conversational emails.

At the same time, Courier features another module, our speech act classifier, that uses carefully hand crafted features that feed a feedforward neural network that identifies the illocutionary force of conversational utterances.

Courier’s task classifier CNN architecture.

Another module that uses manually engineered features is our question classifier (link to that post), which extends the speech act classifier, further analyzing questions into different question types. This module, however, still uses scikit-learn’s SVM to make its predictions.

Deep Learning, Pros and Cons

To finish up this blog post, we would like to quickly review some of the pros and cons that we have identified while transitioning to the DL paradigm.

Pros

Deep Learning models don’t stop learning: Even with modest DL architectures one of the first things a ML practitioner notices is that DL models have a much higher learning power than many traditional ML algorithms. Andrew Ng has a short little video on his Youtube channel that discusses this.
More data doesn’t mean longer processing times: Another nice property that DL has is that, if one keeps untouched the complexity of the model architecture, adding more training data doesn’t mean longer processing times for new, unseen data. This isn’t true in the case of SVM models, for example, since more data means more support vectors and more support vectors mean longer processing times.
No need for feature engineering: As discussed above, DL models that learn from raw text learn patterns from the input data, so there’s no need to spend a significant amount of time formalizing those patterns in the form of features as used to be the case with many old ML algorithms.
Deep Learning algorithms can capture complex phenomena: If fed the right amount of data, which tends to be a lot of data, DL algorithms can learn most complex patterns in the training data.
External knowledge can be easily added: One very nice feature that DL algorithms have is that external knowledge, that is, knowledge that is not present in the training corpus, can be easily added in the form of embeddings pre-trained using algorithms like word2vec or GloVe.
Trained Deep Learning models don’t break frequently: One important feature that comes with the migration from scikit-learn to Keras is that already trained DL models don’t break as frequently when upgrading from one version to another. This used to be the case for years with scikit-learn even when upgrading dot releases.
Deep Learning algorithms can be easily sped up using GPUs: It’s become easier and easier to take advantage of the processing power of modern GPUs when training and running DL models in production. Recently, Keras has even implemented specialized versions of RNN and CNN especially designed to squeeze of every bit of performance out of Nvidia GPUs.
Deep Learning architectures are very flexible: With enough training and practice one can put together DL architectures that can model several input streams at once and combine them in whatever shape or form.

Cons

Slower processing times: As discussed above, one of the first things a ML practitioner notices is that DL models can easily take significant longer processing times than old SVM models on the same hardware.
Using CNNs to speed up some models is not viable all the time: Some problems are better solved using RNN, so replacing them with a CNN architecture to speed up processing times doesn’t necessarily yield the same performance in terms of precision and recall.
Hyperparameters curse: Switching to DL architecture entails the need to optimize more hyperparameters than many traditional ML algorithms. We’ve observed that even the number of epochs or the batch size are hyperparameters that have to be optimized when training DL models.
Deep Learning architectures are too flexible: DL architectures’ flexibility is a double-edged sword. Using out of the box algorithms is fairly straightforward, but once you start building more complex architectures the learning curve is fairly steep.
A lot less documentation and working examples: Combining the previous bullet point with the fact that there’s still a lack of good documentation and working examples makes it harder to get started applying DL algorithms to your own problems.
Deep Learning algorithms are more of a black box than more traditional ML algorightms: A common complaint that one hears even from experienced DL practitioners is that DL algorithms is that many times it is not clear why an architecture works better than another one. This is especially concerning given the fact that some jurisdictions, namely the EU, are approving laws that grant their cizitens “rights for explanation” for ML algorithms that impact their lives.

Last Words

In this blog post we have told the story of our journey of applying DL in Courier and our transition from more traditional ML methods. We think many people who were already doing ML before the DL revolution may share a similar experience with us.

We’ve also discussed the benefits and drawbacks of working with DL models in production and how ML practitioners still need to know very well the problems they’re trying to solve and decide what methods, whether it is DL or older ML algorithms, best fit their production requirements.