How to run the Llama 2 AI model on your MacBook

Great news everyone! Meta just released the Llama 2, the next-generation large language model by Meta. Most importantly they made the model open-source, meaning it’s available for both research and commercial use! Llama 2 promises to revolutionize AI applications and drive innovation, making it a great opportunity to get on board and start building our own applications of this powerful technology.

If you would love to start building your own AI tools but don’t know where to start, fret not! In this guide, I will walk you through the steps to run the Llama 2 AI model on your local MacBook laptop, enabling you to harness its power right from the comfort of your own device.

Prerequisites

Before you begin, ensure that your MacBook meets the minimum requirements for running Llama 2:

MacBook Pro with M1 chip or better
At least 8GB of RAM (16GB or higher is recommended for smoother performance).
macOS 13.4 (Ventura) or later installed.

It will be helpful if you have prior experience dealing with tools such as Terminal and command line interfaces, however, the guide will explain everything step by step, so you’ll still be able to follow. Coding skills are not required, but it will certainly be helpful!

Setting Up Python Environment

To run Llama 2, you’ll need Python and some essential libraries.

First, we will install command line tools (if you haven’t already).

Open Terminal and enter the following command to install Xcode:

% xcode-select --install

Xcode-Select is a command line utility that ships with Xcode, Apple’s IDE (integrated development environment). It allows your system to know where the Xcode developer tools are located. It also includes compilers and provides access to important system files needed to compile the packages you’ll need successfully.

In this case, we need it as it is required by Homebrew, a package manager we will use to install the packages needed to run Llama locally.

Let’s install it next. Paste that in your terminal and press return:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Homebrew, as stated on their page, “install the stuff you need that Apple (or your Linux system) didn’t”. Such as Python!

We will go with Python 3.11:

% brew install pkgconfig cmake python@3.11

Once it’s complete, take note of where it was installed. The installer should output the destination in the terminal and it’s usually the path that looks like

/opt/homebrew/bin/python3.11

We will need this path to create a virtual environment to install Python dependencies.

To do so, we’ll use a Python module called venv, which allows you to create isolated and self-contained environments for your Python projects.

When working on Python projects, especially larger ones, or when collaborating with others, using virtual environments is considered a best practice. It means that any Python packages installed within the virtual environment won’t affect the system-wide Python installation or other projects’ dependencies. It helps your project’s root directory clean and clutter-free, as instead of installing packages globally, all dependencies are neatly stored within the virtual environments directory.

In your terminal prompt navigate to or create a new directory for our project. Then, inside that directory, run the following command to create a virtual environment and let’s name it venv311:

% /opt/homebrew/bin/python3.11 -m venv venv311

Then activate it:

% source venv311/bin/activate

Once activated, you’ll notice that the name of the virtual environment appears in the front of your command line.

Then we’ll install Python dependencies needed to run the AI model:

% pip3 install --pre sentencepiece torch torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Setting up Llama.cpp

To be able to run Llama2 on your laptop, we’ll need a special ingredient – Llama.cpp. Llama.cpp is a port of Llama written in C++, making it possible to run AI models on CPU/RAM only. It is small, optimized, and can run decent-sized models pretty fast! (for a CPU…)

Clone the code from GitHub and enter its directory:

% git clone https://github.com/ggerganov/llama.cpp
% cd llama.cpp

Next, build Llama.cpp using make. In your terminal:

% make

Downloading Llama 2 Model

It’s time to get our AI models! Head to https://ai.meta.com/llama to download the Model. You’ll have to fill in and submit a form. Once completed and approved, you’ll receive an e-mail with a set of instructions on how to proceed.

Llama model comes in different sizes: 7B, 13B, and 70B. The number indicates the number of parameters included. The more parameters, the larger the file size and the more RAM needed to run it smoothly. Because of this, I recommend starting with the smallest model, then exploring the larger ones if your system allows it! The 7B model can be the smallest, but comes with an amazing performance that outshines GPT3!

The Model also comes in two flavors – pretrained and fine-tuned for chat use cases. We will choose the latter as it will provide us with the best, ChatGPT-like experience out of the box.

When following the instructions provided by Meta, download the llama-2-7b-chat model.

If you encounter problems when obtaining the files due to missing wget or md5sum packages, you might have to install them using brew:

% brew install md5sha1sum wget

Once you obtain all the files, you have to move them to the models folder, inside of llama.cpp directory.

The models folder should contain the following files and folders:

llama-2-7b-chat tokenizer_checklist.chi tokenizer.model

Preparing Data

Llama.cpp requires some conversion done to the models before they can be run.

First, let’s install any additional dependencies that might be needed. In the terminal, navigate back to llama.cpp directory and paste:

% python3 -m pip install -r requirements.txt

This will install any dependencies listed by the maintainers of Llama.cpp in the requirements.txt file.

Next, we will convert the models that come in pth format (used by models created with PyTorch) to ggml format, using the provided script:

% python3 convert-pth-to-ggml.py models/llama-2-7b-chat/ 1

Next, quantise the model to 4-bits using q4_0 method:

% ./quantize ./models/llama-2-7b-chat/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0

And boom! The installation is complete and you can try out your own private instance of Llama 2 AI model!

Begin the Adventure!

To run it, in your terminal, paste:

% ./main -m ./models/llama-2-7b-chat/ggml-model-q4_0.bin -n 256 --mlock --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

Voila!

To better understand the launch command and how to optimize AI’s performance on your local machine I recommend you read the llama.cpp documentation which includes a detailed description of each flag and parameter used here!

Currently, I included a sample prompt included with llama.cpp for a better illustration of what’s possible with our AI.

I also included the —lock flag that sets the entire model to be stored in RAM, resulting in much better performance.

So there you go, a fully functional GPT chat! Let me know what did you think and how will you use the AI now that you have it.