Based on the famous check_blas.py script, I wrote this one to check that theano can in fact use multiple cores:

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'

import numpy
import theano
import theano.tensor as T


a = theano.shared(numpy.ones((M, N), dtype=theano.config.floatX, order=order))
b = theano.shared(numpy.ones((N, K), dtype=theano.config.floatX, order=order))
c = theano.shared(numpy.ones((M, K), dtype=theano.config.floatX, order=order))
f = theano.function([], updates=[(c, 0.4 * c + .8 * T.dot(a, b))])

for i in range(iters):

Running this as python3 check_theano.py shows that 8 threads are being used. And more importantly, the code runs approximately 9 times faster than without the os.environ settings, which apply just 1 core: 7.863s vs 71.292s on a single run.

So, I would expect that Keras now also uses multiple cores when calling fit (or predict for that matter). However this is not the case for the following code:

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'

import numpy
from keras.models import Sequential
from keras.layers import Dense

coeffs = numpy.random.randn(100)

x = numpy.random.randn(100000, 100);
y = numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01

model = Sequential()
model.add(Dense(20, input_shape=(100,)))
model.add(Dense(1, input_shape=(20,)))
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit(x, y, verbose=0, nb_epoch=10)

This script uses only 1 core with this output:

Using Theano backend.
Why does the fit of Keras only use 1 core for the same setup? Is the check_blas.py script actually representative for neural network training calculations?


(venv3)herbert@machine:~/ $ python3 -c 'import numpy, theano, keras; print(numpy.__version__); print(theano.__version__); print(keras.__version__);'
(venv3)herbert@machine:~/ $


I created a Theano implementaiton of a simple MLP as well, which also does not run multi-core:

import os
os.environ['MKL_NUM_THREADS'] = '8'
os.environ['GOTO_NUM_THREADS'] = '8'
os.environ['OMP_NUM_THREADS'] = '8'
os.environ['THEANO_FLAGS'] = 'device=cpu,blas.ldflags=-lblas -lgfortran'

import numpy
import theano
import theano.tensor as T


coeffs = numpy.random.randn(100)
x = numpy.random.randn(100000, 100).astype(theano.config.floatX)
y = (numpy.dot(x, coeffs) + numpy.random.randn(100000) * 0.01).astype(theano.config.floatX).reshape(100000, 1)

x_shared = theano.shared(x)
y_shared = theano.shared(y)

x_tensor = T.matrix('x')
y_tensor = T.matrix('y')

W0_values = numpy.asarray(
        low=-numpy.sqrt(6. / 120),
        high=numpy.sqrt(6. / 120),
        size=(100, 20)
W0 = theano.shared(value=W0_values, name='W0', borrow=True)

b0_values = numpy.zeros((20,), dtype=theano.config.floatX)
b0 = theano.shared(value=b0_values, name='b0', borrow=True)

output0 = T.dot(x_tensor, W0) + b0

W1_values = numpy.asarray(
        low=-numpy.sqrt(6. / 120),
        high=numpy.sqrt(6. / 120),
        size=(20, 1)
W1 = theano.shared(value=W1_values, name='W1', borrow=True)

b1_values = numpy.zeros((1,), dtype=theano.config.floatX)
b1 = theano.shared(value=b1_values, name='b1', borrow=True)

output1 = T.dot(output0, W1) + b1

params = [W0, b0, W1, b1]
cost = ((output1 - y_tensor) ** 2).sum()

gradients = [T.grad(cost, param) for param in params]

learning_rate = 0.0000001

updates = [
    (param, param - learning_rate * gradient)
    for param, gradient in zip(params, gradients)

train_model = theano.function(
    inputs=[],#x_tensor, y_tensor],
        x_tensor: x_shared,
        y_tensor: y_shared

errors = []
for i in range(1000):

  • Does it work if you enable OpenMP in Theano? You can do this by adding openmp = True to the theano config.
    – Dr. Snoopy
    Apr 28, 2016 at 10:36
  • 2
    @MatiasValdenegro Thank you. You can not see this in the scripts above, but I did try this and it did not help. However, now it seems to be that openmp_elemwise_minsize prevents multiple cores from being used. I need some more experimenting to understand this fully.
    – Herbert
    Apr 28, 2016 at 11:45
  • I was going to make the same question. You are missing here the link to the github issue, where it looks like you are actually being able to use multiple cores (improving performance up to 4 threads). So now I am a bit lost, but in my installation I still only see one core being used, and the docs say that by default all the cores should be used.
    – rll
    Sep 2, 2016 at 16:53
  • No :( I did not unfortunately.
    – Herbert
    Oct 12, 2016 at 6:39
  • openmp_elemwise_minsize is the size below which the speedup from parallelization isn't worth the overhead. If you lower that threshold, you'll run code in parallel more often but it might not actually get faster. May 3, 2018 at 17:30

Keras and TF themselves don't use whole cores and capacity of CPU! If you are interested in using all 100% of your CPU then the multiprocessing.Pool basically creates a pool of jobs that need doing. The processes will pick up these jobs and run them. When a job is finished, the process will pick up another job from the pool.

NB: If you want to just speed up this model, look into GPUs or changing the hyperparameters like batch size and number of neurons (layer size).

Here's how you can use multiprocessing to train multiple models at the same time (using processes running in parallel on each separate CPU core of your machine).

This answer inspired by @repploved

import time
import signal
import multiprocessing

def init_worker():
    ''' Add KeyboardInterrupt exception to mutliprocessing workers '''
    signal.signal(signal.SIGINT, signal.SIG_IGN)

def train_model(layer_size):
    This code is parallelized and runs on each process
    It trains a model with different layer sizes (hyperparameters)
    It saves the model and returns the score (error)
    import keras
    from keras.models import Sequential
    from keras.layers import Dense

    print(f'Training a model with layer size {layer_size}')

    # build your model here
    model_RNN = Sequential()

    # fit the model (the bit that takes time!)

    # lets demonstrate with a sleep timer

    # save trained model to a file

    # you can also return values eg. the eval score
    return model_RNN.evaluate(...)

num_workers = 4
hyperparams = [800, 960, 1100]

pool = multiprocessing.Pool(num_workers, init_worker)

scores = pool.map(train_model, hyperparams)



Training a model with layer size 800
Training a model with layer size 960
Training a model with layer size 1100
[{'size':960,'score':1.0}, {'size':800,'score':1.2}, {'size':1100,'score':0.7}]

This is easily demonstrated with a time.sleep in the code. You'll see that all 3 processes start the training job, and then they all finish at about the same time. If this was single processed, you'd have to wait for each to finish before starting the next (yawn!).

  • 4
    Your claim that Keras and TF do not use whole cores and capacity of the CPU is just not true, it depends on the model size and the level it can be automatically parallelized, when I train large models on CPU I can see tensorflow using all available cores.
    – Dr. Snoopy
    Jul 9, 2019 at 8:54
  • when I check from windows task manager CPU performance never reach over 30%, also it was a problem of many users in SOF.
    – Mario
    Jul 9, 2019 at 11:46

