Train Stable Diffusion Model on Mac M1/M2 in under 3 minutes

blogging
jupyter
Stable Diffusion
Macbook
Author

Kashish Mukheja

Published

Wednesday, 19 July 2023

What’s happening here ??

So readers, I don’t know why I chose to run a Stable Diffusion Model on my macbook Air M2, when there are already plenty of cloud based GPU options available, when myself, being very consciously aware that it would be a demanding process. Anywho, now that I decided to train the model on my mac, I am jotting down on how, after spending ~3 hours, I was able to run a pre-trained Stable Diffusion Model in under 2 minutes on my M2 Macbok Air. As always, let’s start with installing and importing the necessary libraries..

Code
!pip install -Uq diffusers

Segue: The above command installs and upgrades diffusers library through Pip:

-U: The -U flag stands for –upgrade, which means that if any of the specified packages are already installed, pip will upgrade them to the latest version.

-q: The -q flag stands for –quiet, which makes the installation process less verbose by suppressing unnecessary output. It only displays essential progress information.

Code
from diffusers import StableDiffusionPipeline
import torch
import logging

logging.disable(logging.WARNING)

As already stated, I’m using Mac M2 for running the Stable Diffusion Model, it is imporant that we assign device to mps. MPS device enables high-performance training on GPU for MacOS devices with Metal programming framework. Learn more about it on the official Pytorch docs.

Code
device = "mps" if torch.backends.mps.is_built() else "cpu"
torch_dtype = torch.float16 if device == "mps" else torch.float32

In PyTorch, the torch.float16 command is used to convert tensor data to the 16-bit floating-point format, also known as “half-precision” or “float16.” Half-precision floating-point numbers occupy 16 bits of memory, which is half the size of the standard 32-bit single-precision floating-point format (torch.float32). The primary purpose of using torch.float16 is to reduce the memory footprint and improve computation speed, especially when working with deep learning models on hardware that supports hardware-accelerated float16 operations.

Note: I also tried not to use torch.float16, however, that made the Pipeline execution extremely slow.

Code
print(f'device: {device}, torch_dtype: {torch_dtype}')
device: mps, torch_dtype: torch.float16

The StableDiffusion Pipeline..

Stable Diffusion has something called a Pipeline[1]. If you’re familiar with fast.ai, this is similar to what we call as fastai learner. The pipeline basically contains all the models, processing, inferencing, etc. One can save the pipeline, into the huggingface cloud (also called Hub). Learn more about diffusion inference pipeline[2].

One just need to provide the pre-trained model as repo-id present in the huggingface repo or a path to directory containing pipeline weights, train further and generate images of your choice. You can also save your own pipeline to the hub, for other people to use.

Segue

[1]. Many Hugging Face libraries (along with other libraries such as scikit-learn) use the concept of a “pipeline” to indicate a sequence of steps that when combined complete some task.

[2]. Inference means, using the model to generate output (i.e., images here), as opposed to training (or fine-tuning) models using new data.

Here, we will be using CompVis/stable-diffusion-v1-4 model to generate our image from the text prompt.

Code
model_id = "CompVis/stable-diffusion-v1-4"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch_dtype)
pipe = pipe.to(device)

What else did I try..

From the limited yet somehow various articles I read, I tried running the pipeline by passing several different arguments to the StableDiffusionPipeline.from_pretrained(..) function. However, the above worked best for me. Some of the parameters I tried include:

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    use_auth_token=True,
    revision="fp16", 
    torch_dtype=torch.float16,
    safety_checker=None
).to(torch.device("mps"))
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    use_auth_token=True,
    safety_checker=None
).to(torch.device("mps"))
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    use_auth_token=True
).to(torch.device("mps"))

All the above three resulted in taking ~35 minutes of executing pipeline to generate an image (Present as part of the next code cell). With a few trial-and-errors, I was able to bring down the time from 30 minutes to under 2 minutes

It is recommended at multiple places including the official huggingface docs, that if your GPU is not big enough to use pipe command, run pipe.enable_attention_slicing()

As described in the docs: > When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

BUT WAITTT!! Running the above command before executing the pipe operation to generate image(s) resulted in a completely blacked-out image. I tried various prompts and seeds to generate an image, however, it wasn’t successful on my machine. Hence, I didn’t use this command. If anyone is successfully able to generate an image by enabling the attention slicing mechanism, pls feel free to provide me information on how did you achieve that :)

Code
pipe.device
device(type='mps', index=0)
Code
torch.manual_seed(1024)
prompt = ["European interior design room"]
images = pipe(prompt, num_images_per_prompt=1)[0]
RuntimeError: MPS backend out of memory (MPS allocated: 8.29 GB, other allocations: 664.11 MB, max allowed: 9.07 GB). Tried to allocate 512.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
Code
display(images[0])

As you can see it took less than 3 minutes to generate the image. Albiet, the best I have reached so far is ~1:30 minutes when I had most of applications closed. I know, folks have been able to successfully generate images in under 30 seconds on Mac M1/M2 chips, but for me, this is the best as of today that I could achieve. In my opinion, this machine is the lowest verion of Macbook Air M2, hence, the trade-off. I will keep on updating this blog post as amd when I am able to optimise it further.

Hope the above walk-through helps someone who’s struggling to run stable diffusion pre-trained models on their Macbook machine.

Ahh and yes, don’t forget to upvote this blog if even a tiny ounce of it helped you in some progress, as they always say, an upvote a blog keeps the blogger happy and prompt :)

Thankyouuuusss

Back to top