# Speeding up your Neural Network with Theano and the GPU

Get the code: The full code is available as an Jupyter/iPython Notebook on Github!

In a previous blog post we build a simple Neural Network from scratch. Let’s build on top of this and speed up our code using the Theano library. With Theano we can make our code not only faster, but also more concise!

## What is Theano?

Theano describes itself as a Python library that lets you to define, optimize, and evaluate mathematical expressions, especially ones with multi-dimensional arrays. The way I understand Theano is that it allows me to define graphs of computations. Under the hood Theano optimizes these computations in a variety of ways, including avoiding redundant calculations, generating optimized C code, and (optionally) using the GPU. Theano also has the capability to automatically differentiate mathematical expressions. By modeling computations as graphs it can calculate complex gradients using the chain rule. This means we no longer need to compute the gradients ourselves!

Because Neural Networks are easily expressed as graphs of computations, Theano is a great fit. It’s probably the main use case and Theano includes several convenience functions for neural networks.

## The Setup

The setup is identical to that in Implementing a Neural Network from Scratch, which I recommend you read (or at least skim) first. I’ll just quickly recap: We have two classes (red and blue) and want to train a Neural Network classifier that separates the two. We will train a 3-layer Neural Network, with input layer size 2, output layer size 2, and hidden layer size 3. We will use batch gradient descent with a fixed learning rate to train our network.

## Defining the Computation Graph in Theano

The first thing we need to is define our computations using Theano. We start by defining our input data matrix X and our training labels y:

import theano
import theano.tensor as T

# Our data vectors
X = T.matrix('X') # matrix of doubles
y = T.lvector('y') # vector of int64

Here’s crucial thing to understand: We have not assigned any values to X or y. All we have done is defined mathematical expressions for them. We can use these expressions in subsequent calculations. If we want to evaluate an expression we can call its eval method. For example, to evaluate the expression X * 2 for a given value of X we could do the following:

(X * 2).eval({X : [[1,1],[2,2]] })

Theano handles the type checking for us, which is very useful when defining more complex expressions. Trying to assign a value of the wrong data type to X would result in an error. Here is the full list of Theano types.

X and y above are stateless. Whenever we want to evaluate an expression that depends on them we need to provide their values. Theano also has something called shared variables, which have internal state associated with them. Their value that is kept in memory and can be shared by all functions that use them. Shared variables can also be updated, and Theano includes low-level optimizations that makes updating them very efficient, especially on GPUs. Our network parameters $W_1, b_1, W_2, b_2$ are constantly updated using gradient descent, so they are perfect candidates for shared variables:

# Shared variables with initial values. We need to learn these.
W1 = theano.shared(np.random.randn(nn_input_dim, nn_hdim), name='W1')
b1 = theano.shared(np.zeros(nn_hdim), name='b1')
W2 = theano.shared(np.random.randn(nn_hdim, nn_output_dim), name='W2')
b2 = theano.shared(np.zeros(nn_output_dim), name='b2')

Next, let’s define expressions for our forward propagation. The calculations are identical to what we did in our previous implementation, just that we are defining Theano expressions. Again, remember that these expressions are not evaluated, we are just defining them. You can think of them as lambda expressions that require input values when called. We also use some of Theano’s convenience functions like nnet.softmax and nnet.categorical_crossentropy to replace our manual implementations:

# Forward propagation
# Note: We are just defining the expressions, nothing is evaluated here!
z1 = X.dot(W1) + b1
a1 = T.tanh(z1)
z2 = a1.dot(W2) + b2
y_hat = T.nnet.softmax(z2) # output probabilties

# The regularization term (optional)
loss_reg = 1./num_examples * reg_lambda/2 * (T.sum(T.sqr(W1)) + T.sum(T.sqr(W2)))
# the loss function we want to optimize
loss = T.nnet.categorical_crossentropy(y_hat, y).mean() + loss_reg

# Returns a class prediction
prediction = T.argmax(y_hat, axis=1)

We saw how we can evaluate a Theano expression by calling its eval method. A much more convenient way is to create a Theano function for expressions we want to evaluate. To create a function we need to define its inputs and outputs. For example, to calculate the loss, we need to know the values for $X$ and $y$. Once created, we can call it function just like any other Python function.

# Theano functions that can be called from our Python code
forward_prop = theano.function([X], y_hat)
calculate_loss = theano.function([X, y], loss)
predict = theano.function([X], prediction)

# Example call: Forward Propagation
forward_prop([[1,2]])

Now is a good time to get a sense of how Theano constructs a computational graph. Looking at the expressions for $\hat{y}$, we can see that it depends on $z2$, which in turn depends on $a_1$, $W_2$ and $b_2$, and so on. Theano lets us visualize this:

This is the optimized computational graph that Theano has constructed for our forward_prop function. We can also get a textual description:

SoftmaxWithBias [@A] ''   4
|Dot22 [@B] ''   3
| |Elemwise{Composite{tanh((i0 + i1))}}[(0, 0)] [@C] ''   2
| | |Dot22 [@D] ''   1
| | | |X [@E]
| | | |W1 [@F]
| | |InplaceDimShuffle{x,0} [@G] ''   0
| |   |b1 [@H]
| |W2 [@I]
|b2 [@J]

What’s left is defining the updates to the network parameters we use with gradient descent. We previously calculated the gradients using backpropagation. We could express the same calculations using Theano (see code that’s commented out below), but it’s much easier if we let Theano calculate the derivatives for us! We need the derivates of our loss function $L$ with respect to our parameters: $\frac{\partial L}{\partial W_2}$, $\frac{\partial L}{\partial b_2}$, $\frac{\partial L}{\partial W_1}$, $\frac{\partial L}{\partial b_1}$:

# Easy: Let Theano calculate the derivatives for us!

# Backpropagation (Manual)
# Note: We are just defining the expressions, nothing is evaluated here!
# y_onehot = T.eye(2)[y]
# delta3 = y_hat - y_onehot
# dW2 = (a1.T).dot(delta3) * (1. + reg_lambda)
# db2 = T.sum(delta3, axis=0)
# delta2 = delta3.dot(W2.T) * (1. - T.sqr(a1))
# dW1 = T.dot(X.T, delta2) * (1 + reg_lambda)
# db1 = T.sum(delta2, axis=0)

Because we defined $W_2, b_2, W_1, b_1$ as shared variables we can use Theano’s update mechanism to update their values. The following function (without return value) does a single gradient descent update given $X$ and $y$ as inputs:

gradient_step = theano.function(
[X, y],
updates=((W2, W2 - epsilon * dW2),
(W1, W1 - epsilon * dW1),
(b2, b2 - epsilon * db2),
(b1, b1 - epsilon * db1)))

Note that we don’t need to explicitly do a forward propagation here. Theano knows that our gradients depend on our predictions from the forward propagation and it will handle all the necessary calculations for us. It does everything it needs to update the values.

Let’s now define a function to train a Neural Network using gradient descent. Again, it’s equivalent to what we had in our original code, only that we are now calling the gradient_step function defined above instead of doing the calculations ourselves.

# This function learns parameters for the neural network and returns the model.
# - num_passes: Number of passes through the training data for gradient descent
# - print_loss: If True, print the loss every 1000 iterations
def build_model(num_passes=20000, print_loss=False):

# Re-Initialize the parameters to random values. We need to learn these.
# (Needed in case we call this function multiple times)
np.random.seed(0)
W1.set_value(np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim))
b1.set_value(np.zeros(nn_hdim))
W2.set_value(np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim))
b2.set_value(np.zeros(nn_output_dim))

# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# This will update our parameters W2, b2, W1 and b1!

# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss(train_X, train_y))
# Build a model with a 3-dimensional hidden layer
build_model(print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(x))
plt.title("Decision Boundary for hidden layer size 3")

That’s it! We’ve just ported our code over to Theano. I got a 2-3x speedup on my Macbook (it would likely be more if we had larger matrix multiplications). Note that we’re not using a GPU yet. Let’s do that next!

## Running on a GPU

To speed things up further Theano can make use of your GPU. Theano relies on Nvidia’s CUDA platform, so yo need a CUDA-enabled graphics card. If you don’t have one (for example if you’re on a Macbook with an integrated GPU like me) you could spin up a GPU-optimized Amazon EC2 instance and try things there. To use the GPU with Theano you must have the CUDA toolkit installed and set a few environment variables. The repository readme contains instructions on how to setup CUDA on an EC2 isntance.

While any Theano code will run just fine with the GPU enabled you probably want to spend time optimizing your code get optimal performance. Running code that isn’t optimized for a GPU may actually be slower on a GPU than a CPU due to excessive data transfers.

Here are some things you must consider when writing Theano code for a GPU:

• The default floating point data type is float64, but in order to use the GPU you must use float32. There are several ways to do this in Theano. You can either use the fulled typed tensor constructors like T.fvector or set theano.config.floatX to float32 and use the simple tensor constructors like T.vector.
• Shared variables must be initialized with values of type float32 in order to be stored on the GPU. In other words, when creating a shared variable from a numpy array you must initialize the array with the dtype=float32 argument, or cast it using the astype function. See the numpy data type documentation for more details.
• Be very careful with data transfers to and from the GPU. Ideally you want all your data on the GPU. Ways to achieve this include initializing commonly used data as shared variables with a float32 data type, and to avoid copying return values back to the host using the gpu_from_host function. Theano includes profiling capabilities that allow you to debug how much time you are spending on data transfers.

With all that in mind, let’s modify our code to run well on a GPU. I’ll highlight the most important changes we here, but the full code is available here or at the bottom of the post.

First, to see a significant speedup we should increase the size of our data so that the computation time will be dominated by matrix multiplication and other mathematical operations (as opposed to looping in Python code and assigning values).

train_X, train_y = sklearn.datasets.make_moons(5000, noise=0.20)
nn_hdim = 1000

Next, we change our scalar values to float32 to ensure that our calculations stay within that type. If we multiplied a float32 value by a float64 value the return value would be a float64.

epsilon = np.float32(0.01)
reg_lambda = np.float32(0.01)

To force the storage of our data on the GPU we use shared variables of type float32. Because we are constantly using the input data X and y to perform gradient updates we store those on the GPU as well:

# These float32 values will be initialized on the GPU
X = theano.shared(train_X.astype('float32'), name='X')
y = theano.shared(train_y_onehot.astype('float32'), name='y')
W1 = theano.shared(np.random.randn(nn_input_dim, nn_hdim).astype('float32'), name='W1')
b1 = theano.shared(np.zeros(nn_hdim).astype('float32'), name='b1')
W2 = theano.shared(np.random.randn(nn_hdim, nn_output_dim).astype('float32'), name='W2')
b2 = theano.shared(np.zeros(nn_output_dim).astype('float32'), name='b2')

We remove the inputs from our functions since the X and y values are stored as shared variables (on the GPU) now. We no longer need to pass them to the function calls.

gradient_step = theano.function(
inputs=[],
updates=((W2, W2 - epsilon * dW2),
(W1, W1 - epsilon * dW1),
(b2, b2 - epsilon * db2),
(b1, b1 - epsilon * db1)))

# Does a gradient step and updates the values of our paremeters
gradient_step()

### Running and Evaluation

To use a GPU you must run the code with the THEANO_FLAGS=device=gpu,floatX=float32 environment variable set. I tested the GPU-optimized code on a g2.2xlarge AWS EC2 instance. Running one gradient_step() on the CPU took around 250ms. With the GPU enabled it merely took 7.5ms. That’s a 40x speedup, and if our dataset or parameter space were larger we would probably see an even more significant speedup.

### Full GPU-optimzied code

# Generate a dataset
np.random.seed(0)
train_X, train_y = sklearn.datasets.make_moons(5000, noise=0.20)
train_y_onehot = np.eye(2)[train_y]

# Size definitions
num_examples = len(train_X) # training set size
nn_input_dim = 2 # input layer dimensionality
nn_output_dim = 2 # output layer dimensionality
nn_hdim = 1000 # hiden layer dimensionality

# Gradient descent parameters (I picked these by hand)
epsilon = np.float32(0.01) # learning rate for gradient descent
reg_lambda = np.float32(0.01) # regularization strength

# GPU NOTE: Conversion to float32 to store them on the GPU!
X = theano.shared(train_X.astype('float32')) # initialized on the GPU
y = theano.shared(train_y_onehot.astype('float32'))

# GPU NOTE: Conversion to float32 to store them on the GPU!
W1 = theano.shared(np.random.randn(nn_input_dim, nn_hdim).astype('float32'), name='W1')
b1 = theano.shared(np.zeros(nn_hdim).astype('float32'), name='b1')
W2 = theano.shared(np.random.randn(nn_hdim, nn_output_dim).astype('float32'), name='W2')
b2 = theano.shared(np.zeros(nn_output_dim).astype('float32'), name='b2')

# Forward propagation
z1 = X.dot(W1) + b1
a1 = T.tanh(z1)
z2 = a1.dot(W2) + b2
y_hat = T.nnet.softmax(z2)

# The regularization term (optional)
loss_reg = 1./num_examples * reg_lambda/2 * (T.sum(T.sqr(W1)) + T.sum(T.sqr(W2)))
# the loss function we want to optimize
loss = T.nnet.categorical_crossentropy(y_hat, y).mean() + loss_reg
# Returns a class prediction
prediction = T.argmax(y_hat, axis=1)

# Note that we removed the input values because we will always use the same shared variable
# GPU NOTE: Removed the input values to avoid copying data to the GPU.
forward_prop = theano.function([], y_hat)
calculate_loss = theano.function([], loss)
predict = theano.function([], prediction)

# GPU NOTE: Removed the input values to avoid copying data to the GPU.
[],
# profile=True,
updates=((W2, W2 - epsilon * dW2),
(W1, W1 - epsilon * dW1),
(b2, b2 - epsilon * db2),
(b1, b1 - epsilon * db1)))

def build_model(num_passes=20000, print_loss=False):
# Re-Initialize the parameters to random values. We need to learn these.
np.random.seed(0)
# GPU NOTE: Conversion to float32 to store them on the GPU!
W1.set_value((np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)).astype('float32'))
b1.set_value(np.zeros(nn_hdim).astype('float32'))
W2.set_value((np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)).astype('float32'))
b2.set_value(np.zeros(nn_output_dim).astype('float32'))

# Gradient descent. For each batch...
for i in xrange(0, num_passes):
# This will update our parameters W2, b2, W1 and b1!

# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print "Loss after iteration %i: %f" %(i, calculate_loss())

# Profiling
# theano.config.profile = True
# theano.config.profile_memory = True

%timeit gradient_step()

Get the code: The full code is available as an Jupyter/iPython Notebook on Github!

## 28 thoughts on “Speeding up your Neural Network with Theano and the GPU”

1. Abhishek Shivkumar says:

This is a more structured and comprehensive explanation than the Theano documentation itself. Thanks again.

2. Wang Zhiyang says:

I used the following two lines to truncate the data.

train_X = train_X.astype(np.int32)
train_y = train_y.astype(np.int32)

• Denny Britz says:

Thanks! I’ll update that in the code.

• Wang Zhiyang says:

These materials are amazing. I am looking forward to the LSTM part!

3. Phong says:

Hi,

I am not sure if Theano can utilize mutiple cpus in case there is no gpu device?

4. LI Bo says:

Hi,

Thank you for the instruction! However I’ve got an error reported when I was trying to executed your code (on CPU, I haven’t try the GPU one yet). Could you check the attached pics for details and help me with it? Thank you again!

• Denny Britz says:

Hm, I have never seen this before, sorry. I could be related to the OS or Theano version. Are you using Theano 0.7? And it seems like you’re on Windows, which could also be a reason, I haven’t tried running it on Windows. Maybe try reinstalling Theano…

• LI Bo says:

Hi,

Thank you for your reply. I have tried all this, but the result is the same. I even reinstalled Python, but the there is no change. After that I have looked carefully about the code, and I found that the error is because of this line of code:

gradient_step = theano.function([X, y], updates = ((W2, W2 – epsilon * dW2), (W1, W1 – epsilon * dW1), (b2, b2 – epsilon * db2)(b1, b1 – epsilon * db1)))

The system will report the error when running at the line above. However, if I change the line of code to:

gradient_step = theano.function([X, y], updates = ((W1, W1 – epsilon * dW1), (b1, b1 – epsilon * db1)))

It is fine then. So I was wandering maybe it is because my code contain some bugs. But I have checked it many times, and I still can’t understand why this is happening.

So could I send my source code to you, and you can run it on your own computer to find out where the bug is, if it will not cause you any trouble? Please tell me your email if you can help me.

Thank you again and Marry Christmas~~

• Denny Britz says:

Hm, the code on Github works on my machine. The first line of code you posted above has a comma missing (after db2). If that’s not the cause then unfortunately I have no idea….

• LI Bo says:

Finally, the problem is solved by reinstalled the g++.

I have noticed that you didn’t plot the decision boundary after implemented using GPU. And I have tried add the plot_decision_boundary() function after the code with GPU. But there are an error reported directing to “Z = predict(np.c_[xx.ravel(), yy.ravel()])” which in the plot_decision_boundary function. And the error type is “TypeError: Tried to provide value for implicit input: X”. Do you know what cause this error, and how can I fix it?

Thank you~

• Denny Britz says:

In the GPU code the X variable is stored as a shared variable on the GPU and not a graph input, so you can’t provide it as an input. I guess there are two ways: Either make X an input to the graph like in the CPU code, or manually set the value of the shared X variable in the plot function. Take a look at the Theano documentation to understand shared variables better if it’s not clear.

5. Riaan says:

Hey!
First off, nice tutorial!
Second, I have a question. I Have basically used this tutorial to create an object which I can pickle and predict values with. This basically means throwing everything in an object and replacing the softmax with sigmoid.

Now, for some reason I get a typeError when I try to use the forward propagation function with new data. I have tried to put the data in an array a numpy array and explicitly cast this data to a float32.

Do you have any idea how you can make new predictions?

regards,

Riaan

6. underspirit says:

Hey!
I’m doubt that if I use GPU-optimzied code,how can I plot the decision boundary,how to transfer X to predict function?

7. Hai says:

Great tutorial! I’ve followed both this one and the one that implements NN from scratch. When implementing NN from scratch, you get overfitting with nn_hdim=50. But with Theano, you use nn_hdim=100 (on github) and get good result. Can you please explain the reason? Thank you very much!

• Denny Britz says:

I think I never evaluated the Theano implementation on the test set (or plotted the result), but it should do exactly the same as the plain version. In the Thenao GPU code I use 5000 training examples instead of just 200, so that prevents the overfitting.

• Hai says:

Thanks for your reply! But in your github ( https://github.com/dennybritz/nn-theano/blob/master/nn-theano.ipynb ), you set “nn_hdim = 100 # hiden layer dimensionality”. Then you build and plot the result: “# Build a model with a 3-dimensional hidden layer

build_model(print_loss=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(x))

plt.title(“Decision Boundary for hidden layer size 3″)”

I assume the inconsistency between “nn_hdim=100” and “hidden layer size 3” is a typo. But anyway, you trained a NN with 100 hidden units and the result doesn’t seem overfitting.

• Denny Britz says:

Yep, you are right! It has been a while since I wrote this code so I don’t quite remember what I did ;) Running the Theano code and plain code side by side it seems like Theano is slower to converge and doesn’t overfit either. I briefly looked at the code and I didn’t see an immediate reason for why.. Unfortunately I don’t have time to look at it in detail right now. Some guesses are that it could be related to how regularization (and its gradients) are computed or how the gradients are applied (batch vs. single example, maybe Theano does something under the hood, or something with the learning rate), or what theano’s loss function does.

Intuitively the network definitely *should* overfit without any regularization and so little training data. If you have time to look into it and find out please let me know and I will fix it in the notebook/tutorial.

• Denny Britz says:

Quick update: I quickly tested a different learning rate, and it seems like with a high-enough learning rate the Theano code also overfits the same way. So my guess is that Theano’s cross entropy loss function or the gradient calculations divides by the number of examples in the batch (effectively lowering the learning rate since it’ll divide the gradients by a large number), and I don’t do that in the plain code.

• Hai says:

Thank you very much for your quick reply and nice explanation! I’ll also take a look at Theano’s code. I’m new to this field so it may take a while. Thanks again!!!

8. Aaron St John says:

Great tutorial! I want to re-ask a question that’s been alluded to below, but never really answered.

In the GPU version of this code, you remove the inputs to the ‘predict’ function because all of the training data gets pushed to the GPU as part of the optimization.

As a result, it’s not clear how to use the model once it’s been trained to predict results for new inputs. More specifically, the ‘plot_decision_boundary’ method no longer works as a method to view the output as it relies on predicting the model’s results from a random set of points.

Can you show an example of how to use the GPU trained model to predict a new input? Also, is there a version of ‘plot_decision_boundary’ that can use a model trained on the GPU?

Thanks!

• Denny Britz says:

Instead of giving the new data as an input you can set the shared X variable to that data. Something like X.set_value(…). When you use the predict function it will then use the new shared value instead of the training data. You can also modify the plot_decision_boundary to make use of that.

• Aaron St John says:

Thanks for the quick response, I managed to figure it out in the interim and zeroed in on exactly what you are suggesting here. Here’s my updated version of your plotting function for anyone that stumbles on this with a similar question and the image it produces after building the model.

https://gist.github.com/aaronstjohn/0dd049abdfb8bceab724f8932202eb24

NOTE: you just have to call ‘plot_decision_boundary_gpu(predict)’ (i.e., just pass the ‘predict’ function in directly) to generate the plot. There’s no need for a lambda function to pump in values since this new plot function just stuffs them into the shared ‘X’ directly.

Thanks again for this awesome series of articles

• Denny Britz says:

Great, that for sharing this! It’ll be helpful to a lot of people : )

• mTor says:

@denny and @aaron: I get:
X.set_value((np.c_[xx.ravel(), yy.ravel()]).astype(‘float32’))
AttributeError: ‘TensorVariable’ object has no attribute ‘set_value’
when using your suggestion above: what am I doing wrong?

9. Kasirat Kasfi says:

Hi Denny, thank you for the amazing post, really helped me to get an understanding of how Theano works. I am starting with Theano for the first time. I am trying to extend this to a simple sparse autoencoder. I have made some changes, but I can not figure out how to deal with the Nan values I am getting after some of the operations. Could you please help me out with it? Thanks a lot!