Turbocharge ML with JAX and TPUs

1.0.1 Supercomputer for ML

Google designed Cloud TPUs as a matrix processor focused making training and inference of neural networks faster, and more power efficient. The TPU is built for massive matrix processing, and its systolic array architecture assigns thousands of interconnected multiply-accumulators to the task. Cloud TPU v3 contains two systolic arrays of 128 x 128 ALUs, on a single processor. For workloads bound by matmul, TPU can generate significant efficiencies.

1.0.2 Getting started

Prerequisite: a Google Cloud project.

There are several ways to run the commands below.

Vertex AI: If you’re running notebooks from within Vertex Workbench, simply open a terminal within the notebook instance and run the commands. Local: Download the gcloud SDK, or open a shell from within Cloud console. * Compute Engine VM: run the commands to set up a TPU VM.

1.0.3 Setting up the VM

First, run the following command to enable the TPU API, and set your user and project configuration:

gcloud services enable tpu.googleapis.com
gcloud config set account <your-email-account>
gcloud config set project <your-project>

1.0.4 Create the TPU VM

For more information, an extensive guide can be found here.

gcloud compute tpus tpu-vm create tpu-name  \
  --zone=zone \
  --accelerator-type=${ACCELERATOR_TYPE}  \
  --version=tpu-vm-v4-base

Note: For v2 and v3 configurations use the tpu-vm-base TPU software version. For v4 configurations use tpu-vm-v4-base. The correct version of libtpu.so is automatically installed when JAX is installed on the machine.

Since there is no TPU specific JAX software version, we have to manually install JAX on the TPU VM.

To see a list of versions (such as TensorFlow, other PyTorch versions), replace with zone with the zone of yoru project (eg us-central1-b) and run:

gcloud compute tpus tpu-vm versions list --zone <ZONE>

For all TPU types, the version is followed by the number of TensorCores (e.g., 8, 32, 128). For example, –accelerator-type=v2-8 specifies a TPU v2 with 8 TensorCores and v3-1024 specifies a v3 TPU with 1024 TensorCores (a slice of a v3 Pod).

1.0.5 Connecting to a TPU VM

From one of the options above (Workbench, local terminal, Compute Engine VM etc), adjust the VM name and zone placeholders:

gcloud compute tpus tpu-vm ssh <your-tpu-vm-name> --zone <your-zone>

1.0.6 Connecting to a TPU VM via local notebook

One of the most popular ways to connect is via a Jupyter Notebook either on another VM or a local machine. This means that rather than develop in notebooks and move .py files to the TPU VM to run them, all experimentation on the notebook can benefit from the TPU.

1.0.6.1 Steps:

Set up the TPU VM as above, and connect from your local machine (or another VM) with a slightly different command (change the parameters within <...>):

gcloud compute tpus tpu-vm ssh <tpu_vm_name> --zone <zone>  -- -L 8888:localhost:8888

Once connected for the first time, install the JAX library:

pip install --upgrade 'jax[tpu]>0.3.0' \
  -f https://storage.googleapis.com/jax-releases/libtpu_releases.html"

Let’s check JAX is working before going further. From the terminal connected to the VM, try the following lines of code (number of devices will vary depending on configuration):

python3
>>> import jax
>>>  num_devices = jax.device_count()
>>> device_type = jax.devices()[0].device_kind
>>> print(f"Using {num_devices} JAX devices of type {device_type}.")
>>> Using 8 JAX devices of type Cloud TPU.

It should then be possible to launch a notebook running on the TPU via the usual command:

jupyter-lab

or

jupyter notebook

depending on whether you have JupyterLab or classic Jupyter Notebook installed. Now accessing localhost:8888 in a browser (use the link in the terminal that results from the commands above) should take you to the notebook environment.

1.0.7 Connecting to a TPU VM

From one of the options above (Workbench, local terminal, Compute Engine VM etc), adjust the VM name and zone placeholders:

gcloud compute tpus tpu-vm ssh <your-tpu-vm-name> --zone <your-zone>

1.0.7.1 Connecting to a TPU VM via local notebook

One of the most popular ways to connect is via a Jupyter Notebook either on another VM or a local machine. This means that rather than develop in notebooks and move .py files to the TPU VM to run them, all experimentation on the notebook can benefit from the TPU.

1.0.7.2 Steps:

Set up the TPU VM as above, and connect from your local machine (or another VM) with a slightly different command (change the parameters within <...>):

gcloud compute tpus tpu-vm ssh <tpu_vm_name> --zone <zone>  -- -L 8888:localhost:8888

It should then be possible to launch a notebook running on the TPU via the usual command:

jupyter-lab

or

jupyter notebook

depending on whether you have JupyterLab or classic Jupyter Notebook installed. Now accessing localhost:8888 in a browser (use the link in the terminal that results from the commands above) should take you to the notebook environment.