I’ve been spending some time learning deep learning and tensorflow recently, and as part of that project I wanted to be able to train models using GPUs on EC2. This post contains some notes on what it took to get that working. As many people have commented, the environment setup is often the hardest part of getting a deep learning setup going, so hopefully this will be useful reference to someone.
The end result of my work was a packer image definition that
builds an AMI that boots an Ubuntu on an ec2 g2.2xlarge
with all the
appropriate drivers and CUDA and cuDNN installed, and a GPU-enabled
tensorflow preinstalled into a virtualenv.
I’m not going to make my AMIs themselves public, since this is a side
project and I don’t want anyone accidentally depending on my AWS
account, but the packer script should Just Work if you packer build
it with appropriate credentials.
I probably could have used someone else’s AMI and saved myself a lot of pain, but I have a near-obsessive need to understand how things work under the hood and Do It Myself.
The rest of this post is some gotchas I learned, that will hopefully help someone else.
Installing the kernel driver 🔗︎
Installing CUDA requires two components; A kernel driver, and the userspace libraries.
The CUDA installer includes a version of the kernel
driver. However, the GRID card in the g2.2xlarge
instances isn’t
supported by the newest NVIDIA driver, so one has to install the
driver separately. Amazon has
documentation on finding a driver, but the
NVIDIA search page it linked gave me a version that
told me it was too new when I actually tried to install it. It pointed
me at the 367.xx
series, which ended up working.
NVIDIA’s installers all seem to support automated install modes, but
they aren’t always easy to discover. --accept-license --no-questions --ui=none
was the secret here.
Building any kernel module requires kernel headers, and requires that
you build the module for the kernel you’re going to be running. On
Ubuntu you can install linux-headers-generic
and
linux-headers-virtual
to ensure you always have the latest
headers. However, if you’re not also running the latest kernel,
you’ll be out of luck.
In general (and this is the technique my packer build uses), I find the best luck by:
- Installing
linux-headers-virtual
- Upgrading the kernel (
linux-image-virtual
) - Rebooting the machine (to run the latest kernel)
- Explicitly installing
linux-headers-$(uname -r)
to be sure you have the right headers. - Installing the kernel module
If you ever upgrade kernels again, you’ll likely need to reinstall the driver. On ec2 I solve this by rebuilding the AMI if I need a new kernel.
One other gotcha is that, on ec2, you’ll also need the
linux-image-extra-*
package. This contains kernel drivers not
normally needed in a virtualized environment. However, that includes
display drivers and their supporting subsystems, and so your modules
will fail to load without it.
Also, by default, once you’ve got the extra-
package, the system
will try to load the upstream nouveau
driver, which will grab the
GPU before NVIDIA’s driver can. To prevent that, we need to
install a blacklist file
Installing CUDA 🔗︎
The CUDA installer can be downloaded from NVIDIA’s
website. Like the driver, the installer can be used in a headless
mode, but of course it has a different inscrutable set of options. I
found the combination I wanted was the oxymoronic --toolkit --silent --verbose
; --toolkit
tells it to install the libraries and
development toolchain; --silent
actually means “headless mode”, and
--verbose
makes it direct output to stdout, as well as a log file
(so that it shows up in the packer output).
I also pass an explicit path for the toolkit,
--toolkitpath=/usr/local/cuda-8.0
. A CUDA upgrade broke me when
NVIDIA changed the default path from /usr/local/cuda
to
/usr/local/cuda-8.0
. I tried --toolkitpath=/usr/local/cuda
, but
the installer appears to blacklist that. So now I use cuda-8.0
,
explicitly, for future-proofing, and create a symlink.
Installing cuDNN 🔗︎
cuDNN requires an NVIDIA developer login, which is free but requires you answer a lot of silly questions about what you’re going to use CUDA and cuDNN for.
Unfortunately, the download links don’t work unless you’re logged in, which is a problem for automated installation. I mirrored the cuDNN installer into an S3 bucket for automated fetching.
Once you’ve got it, the cuDNN
installer works yet again differently
from the previous installers; It’s a tarball that can be unpacked
directly into /usr/local
on top of your cuda installation.
Installing tensorflow 🔗︎
Thanks to recent improvements in Python packaging, and some careful
work on Google’s part, installing tensorflow itself is the easy
part. If you’ve got a virtualenv
with a recent pip
, pip install tensorflow-gpu
is good enough!
Per the
tensorflow docs
you can also provide a direct link to a whl
file hosted by Google,
so that you don’t depend on PyPI to find the package. I found this
necessary for my install to work in TravisCI, for whatever reason.
Conclusion 🔗︎
NVIDIA makes available pre-built AMIs that probably would work, and there are various services working on providing easier hosted tensorflow environments. So, if you’re lucky, none of this will ever be relevant to you.
But various people still have need to install these tools themselves, or even onto physical hardware. So hopefully this experience report will help someone. Let me know if any of this advice is useful!