### Building a Deep Learning instance on GCP

#### Everything from CLI

This is a complete guide to setting up a usable/practical Deep Learning virtual machine on GCP.

I started out by attempting to use the (very attractive-looking) Deep Learning VM images from Google. However, these had several problems:

• Older version of Python
• Rather bloated install
• … and the whole experience seemed to be spiraling away from what I was actually looking for.

The following details my approach for satisfying the following goals:

• Command-line creation/building of the VM (specifically a preemptible one with GPU)
• Reproducible scripting, so that subsequent restarts would reboot nicely
• Ability to make changes on a local machine that would get incorporated into the build

Here goes… (if you want to see what the overall build/boot process looks like, skip forward to the “The Overall Scheme” section below)

### Create a local working directory

This is so that images of software, etc, can be transported up ‘whole’ : I didn’t want to give credentials to in-house repos explicitly to the GCP machine (paranoia, I know).

We’ll create a folder in here (called ../gcp/repobase in the scripts) that gets built out locally, and then zipped up into ../gcp/upload.tar.gz so that it can be sent up to the server in one copy operation.

### Create the start-up script

Here’s the kind of script that’s needed to be run by the GCP machine when it’s started. Because it stores an “~/INSTALLATION_DONE” flag, this will avoid doing extra work when you come to restart the server (after, for instance, it gets preempted or you stop the machine).

After doing the sudo apt install stuff (needed just once), the installation script also builds a venv python virtual environment, rather than pollute the whole VM. This also needs only to be done once, and as more packages are needed, you can add them to this script (so that when building another machine they get run on first start-up).

I’ve avoided including references to any requirements.txt files, since these tend to be less well maintained / too restrictive for active development (which is usually using current package versions).

Call this script ../gcp/3-on-start.bash :

You will be able to check the results of running this start-up script by doing (within the running instance, when we’ve got it ready) :

### Create an ‘asset gatherer’

The purpose of this is to gather up the assets from your project that are required for the VM to serve whatever it’s going to serve.

Let’s call this ../gcp/2-gather-assets.bash (there is also a ../gcp/1-gather-bulk-assets.bash, but it follows a very similar pattern, but is only used once, since those assets are static, compared to the code+model assets that are added below).

So this will leave us with an upload.tar.gz in our ./gcp directory, which we can upload whenever the GCP machine is ON, and simply tar -xzf upload.tar.gz on the GCP machine to get new code updates, etc, installed :

And in a separate tab on the local machine (for convenience):

### Now Create a Compute Instance

The following (long) command creates a new VM. You’ll need to choose the zone appropriately (make sure you have GPU quota available!). Of course, your project name will be different (as may be some of the choices of GPU, etc).

The key points to note below are:

• Includes a preemptible Tesla T4
• Chooses a recent Ubuntu VM image which has Python 3.8.10
• Includes a decent-sized boot disk (50Gb)
• Points to an ‘on-start’ script that we can update easily

NB: It would be a good idea to create the startup-script, and the assets .tar.gz ahead of time, at least in prototype form (as above).
OTOH, doing these steps out-of-order shouldn’t be a deal-breaker, since the process is designed to be run/re-run as required.

These commands are run locally to fix the project/instance we’re focussed on:

And then run these to actually create the instance:

FWIW, from the GCP VM pricing estimate (though the numbers below are only when it’s switched ON, apart from the persistent disk) is :

The N1-standard-8 choice may be a bit of an overkill for a Deep Learning instance, though (compared to scaling up the GPU side) it’s relatively low incremental cost for the additional room / cores provided (particularly, since in production, as much of the compute will be moved to CPUs as possible).

### Important Commands

Important command pallete (for quick reference).
Execute these on your local machine, once the gcloud config set project ${PROJECT_NAME} and INSTANCE_NAME things are set as above: ### The Overall Scheme NB: The first time this is run, there won’t be a startup-script present. So the sequence is: • Run the gcloud compute instances create command (above) • The new machine starts up (can check this on the GCP ‘console’ panel) • Do gcloud compute scp upload.tar.gz${INSTANCE_NAME}:~, which puts the upload.tar.gz file onto the host
• gcloud compute ssh into the running instance to check it’s really there
• And expand out the files from the upload (on the GCP machine):
• tar -xzf upload.tar.gz
• cp ./repobase/some-longish-path/gcp/* .
• Restart the GCP machine
• This should now run the ~/3-on-start.bash file for the first time
• installs the Nvidia drivers (takes 15-20mins)
• sets up ~/env38 etc
• You can monitor this start-up process using sudo journalctl -uf google-startup-scripts.service on the GCP machine
• Once it’s complete, a ~/INSTALLATION_DONE file will appear in your home directory
• Now the GCP machine can be started/stopped relatively easily
• You can chain additional start-up scripts at the end of this file
• For instance in ./4-next-steps.bash

All done!