This write-up is for Laptops with ‘separate’ NVidia graphics cards, and makes use of the Bumblebee Nvidia installation.

For more typical desktop instructions, see the write-up here.

Bumblebee RPMs

The basic RPM installation is as before, in the previous write-up.


The CUDA installation should be done from the Nvidia downloads site, as usual.

The end result should be the standard tree of folders :: /usr/local/cuda/{include,bin}.

Source fixes

Now, as root, fix up Nvidia disallowing gcc greater than v4.9

In file /usr/local/cuda/include/host_config.h, look to make the following replacement :

// #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ > 9)  // Old version commented out
// This is the updated line, which guards again gcc > 5.1.x instead
#if __GNUC__ > 5 || (__GNUC__ == 5 && __GNUC_MINOR__ > 1)

Now, to set the right path for nvcc, in the user’s ~/.bash_profile add ::

export PATH=$PATH:/usr/local/cuda/bin

Test the installation

Check that the kernel modules is there :

sudo optirun lsmod | grep nv
nvidia_uvm             69632  0 
nvidia               8380416  28 nvidia_uvm
drm                   331776  7 i915,drm_kms_helper,nvidia
# NB: With no 'optirun'
sudo lsmod | grep nv
## -Nothing-

Looking good:

/usr/local/cuda/bin/nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Thu_Jul_17_21:41:27_CDT_2014
Cuda compilation tools, release 6.5, V6.5.12

This works better than the previous desktop …

optirun nvidia-smi -L
GPU 0: GeForce GT 750M (UUID: GPU-9cabfc96-3f6e-889d-29c5-57057738f794)

(and without the optirun) :

nvidia-smi -L
NVIDIA-SMI couldn't find library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains to your system PATH.

Installation of libgpuarray

Install the bleeding edge libgpuarray into your virtualenv - first compile the .so and .a libraries, and put them in a sensible place :

. env/bin/activate
cd env
git clone
cd libgpuarray
mkdir Build
cd Build
  -DCUDA_CUDA_LIBRARY=/usr/lib64/nvidia-bumblebee/ \
  -DCUDA_INCLUDE_DIRS=/usr/local/cuda/include \
  -DOPENCL_LIBRARIES=/usr/lib64/nvidia-bumblebee/ \
sudo make install

This will likely complain about not finding clBLAS, which isn’t a problem here. Although, if you know you will require clBLAS in the future (and this is for advanced/experimental users only), see my OpenCL post, since you need to install this before running cmake above).

Next, install the Python component (after going into the same virtualenv) :

cd env/libgpuarray/
python build
python install

And then test it from within a regular user directory (using the same virtualenv) :

optirun python
import pygpu

A good result is something along the lines of :

<pygpu.gpuarray.GpuContext object at 0x7f1547e79550>
## Errors seen :
#(A) 'cuda'      :: 
##  pygpu.gpuarray.GpuArrayException: API not initialized = WEIRD

#(B) 'cuda0'     :: 
##  pygpu.gpuarray.GpuArrayException: No CUDA devices available = GO BACK...

#(C) 'opencl0:0' :: 
##  RuntimeError: Unsupported kind: opencl (if OpenCL library not found)

Theano stuff

Store the following to a file :

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print 'Used the cpu'
    print 'Used the gpu'

And then run, successively :

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cpu  optirun  python 
""" output is ::
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 5.44066691399 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761  1.62323284]
Used the cpu


THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu  optirun  python 
""" output is ::
Using gpu device 0: GeForce GT 750M
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.06558203697 seconds
Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
Used the gpu


THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=cuda0  optirun  python
""" output is ::

Check on the usage of GPU / BLAS

TP=`python -c "import os, theano; print os.path.dirname(theano.__file__)"`
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=gpu optirun python ${TP}/misc/

Total execution time: 9.38s on CPU (with direct Theano binding to blas).
Total execution time: 0.44s on GPU.
## GPU : 0.44s (GeForce GT 750M)
## CPU : 9.38s (i5-4200U CPU @ 1.60GHz)

Martin Andrews

{Finance, Software, AI} entrepreneur, living in Singapore with my family.

blog comments powered by Disqus