At a Glance
- Velero is the most popular free backup solution for Kubernetes, with nearly 100% market share
- Velero offers flexible configuration options, but requires substantial domain knowledge for proper configuration and maintenance.
This blog post is based on the result of our internal Kubernetes backup product research.
Afi.ai itself develops a Kubernetes backup platform (see the product page if you're interested). We attempt to remain as impartial and objective as possible, and we hope that you will find this review useful when evaluating your options.
Kubernetes users may believe that they don’t need backups because they run highly available clusters and everything is deployed from files in git repos and Terraform scripts. However, the high availability doesn’t guarantee that Kubernetes applications can be recovered and be fully operational if the configuration and underlying data are modified, corrupted, or lost (due to user errors, or cybersecurity incidents).
Kubernetes-native backup options can help recover and minimize downtime from misconfigurations, malware and system failures. Velero – with over 20,000 estimated active users – is by far the most popular Kubernetes backup tool.
In this blog post we explore how to start using Velero to protect your Kubernetes environment, and explore its key configuration options.
Velero is an open-source backup tool that powers automated backup and restore of Kubernetes clusters configuration and persistent volumes. It helps you recover Kubernetes workloads in case of data loss, or migrate your workloads and data to another cluster.
Velero can perform both on-demand and scheduled backups, allowing you to back up your data before major Kubernetes or service updates, and providing continuous protection for your workloads.
As a Kubernetes-native backup tool, Velero backs up and restores both persistent volumes (PVs) —which store applications’ data— as well as Kubernetes application configuration which includes deployment manifests, resource allocation, and network configs that are needed to recover the K8s applications automatically.
In contrast, when you use non-native tools to protect Kubernetes your backups will not include all application configs, and you will need to recover/recreate your system in multiple steps. E.g. if you use storage snapshots to backup PVs you may need to manually re-configure persistent volume claims (PVCs) and Kubernetes cluster configs.
One distinct non-native K8s backup option is virtual machines (VM) backup. If you rely on VM backup software to backup all Kubernetes nodes, your backups will include all application data and configuration. However, the recovery options will be limited to a full cluster recovery with all nodes, as granular application-level or file-level recovery will not be possible due to the lack of Kubernetes configuration awareness within the backup software. Additionally, some backups may be inconsistent and unrecoverable, since VM backup software is unaware of Kubernetes applications running inside VMs (unlike Velero, which addresses this issue as we’ll see in the next section).
Velero offers extensive support for Kubernetes distributions and infrastructure environments, including major cloud providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform, Digital Ocean, Alibaba Cloud), and on-premises environments (VMware vSphere, Rancher, OpenShift, etc).
Velero has 3 main options with regards to the backup mechanism and backup data location:
We're going to describe all the three options, but to implement a reliable backup solution, we recommend using option
Regardless of the backup option you choose, Velero always stores Kubernetes cluster configuration/manifests backups offsite, in an object storage bucket. This approach ensures the recovery of Kubernetes configuration in case the cluster has been destroyed or is no longer available.
Plain CSI snapshots are the default backup option. Velero will store Kubernetes configuration backups in a S3 storage bucket, and Persistent Volume (PV) backups will be stored on the same storage from which they are taken.
This option enables fast data recovery since backups are stored locally in the Kubernetes cluster storage infrastructure. However, it poses a risk of losing all backup data in the event of storage malfunction, human error, or malware.
A filesystem-level backup is performed by scanning all files and directories in a Persistent Volume (PV). Velero uses two open-source tools - Restic and Kopia to scan the file systems of PVs in your cluster, identify changes at the file level, and send the changed data (backup increments) to an offsite backup storage.
Because the scanning happens at the same time when K8s applications update their data on the PVs, the resulting backups may have inconsistencies. E.g. if multiple files are updated by a K8s app at the same time as they're being scanned and uploaded to a backup, the backup may have outdated versions of some files.
To avoid the inconsistency in backup data and increase recoverability, applications are normally frozen (put into a paused state) when a filesystem-based backup is being run. Scanning a whole filesystem usually takes much longer than making a CSI snapshot of the storage system that contains the filesystem, so the application may stay frozen for extended periods. The resulting downtime may last for hours or days for complex applications with many files.
The filesystem backup option was the only way to back up data and store it offsite before the CSI Snapshot Data Movement option was introduced with the Velero 1.12 release in August 2023. Now, this option is rarely used. One reason you may still need to use the filesystem backup mechanism is if you use EFS, AzureFile, NFS, emptyDir, or any other volume type that doesn’t have a native snapshot concept (making the Option
CSI Snapshot Data Movement is a new Velero backup mode released in August 2023 and it helps resolve performance issues associates with the filesystem backup mechanism described above.
When executing a backup, Velero uses CSI to instruct the underlying storage system to take a PV snapshot. Velero then uses Restic/Kopia to create a file-level backup from the snapshot and move it to an offsite backup storage location.
The main benefit of this new backup mode is that it doesn’t need to perform a file-level scan of a live persistent volume. It instead uses a virtual disk created from a CSI storage snapshot to take a file-level backup (using Restic or Kopia) and move it offsite. Because storage snapshots can be taken fast, this mode allows Velero to create consistent backups with minimal application downtime.
In order for the backup process to work, each storage device must have enough available storage space to store the temporary CSI storage snapshots used to generate the file-level backups. If you run an on-premises Kubernetes cluster, you need to take this into account when sizing your storage system. It’s recommended to have approximately 30% free storage headroom to store the snapshots.
We are going to use Google Cloud Platform (GCP) for this example. The installation instructions for other cloud providers are very similar.
Before installing Velero, we need to create a Google Cloud Storage (GCS) bucket to store backups. You can use GCP Console or gsutil to create a storage bucket. In this article, we will use "velero-intro-bucket" as the name of the bucket.
To authorize Velero to use GCP APIs, it needs two types of permissions: access to the “velero-intro-bucket”, and the ability to create snapshots and disks. You can find detailed instructions on Velero’s GitHub page. Follow the steps provided at https://github.com/vmware-tanzu/.
After completing the setup, you will have a file named “credentials-velero” in your working directory.
To create CSI snapshots, Velero requires a snapshot class that specifies the storage parameters. “pd.csi.storage.gke.io” is commonly used in GKE as the CSI driver.
Create a snapshot class that uses this CSI driver with all parameters set to their defaults:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
labels:
velero.io/csi-volumesnapshot-class: "true"
name: velero-snapclass
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
The key element of this storage class definition is the label 'velero.io/csi-volumesnapshot-class.' This label instructs Velero to use this snapshot class unless a backup policy specifies otherwise.
Next, we need to configure encryption for backups, as Velero's default settings is unsecure. Without setting a password, Velero will use a default password, which could allow anyone with access to the velero-intro-bucket to read and decrypt backup contents.
Before installing Velero we have to create a namespace for it, and a secret that holds your encryption password:
%: kubectl create ns velero
%: kubectl -n velero create secret velero-repo-credentials \
--from-literal=repository-password=YOUR-PASSWORD
There are two important notes related to the the encryption password.
First, the password is stored in the same cluster that Velero protects. If the cluster is unavailable, the password will be inaccessible, making it impossible to restore data from backups. Ensure you have a secure copy of all passwords used with Velero.
Secondly, Velero does not support password changes for existing backups or key rotation.
And the final step is to install install Velero:
%: velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.9.0,velero/velero-plugin-for-csi:v0.7.0 \
--bucket velero-intro-bucket \
--secret-file ./credentials-velero \
--use-node-agent \
--default-snapshot-move-data \
--features=EnableCSI \
--wait
Let’s walk through a simple example of backup and restore using Velero. We will use a WordPress (WP) instance as our example application.
Let’s install WP first:
%: helm install --namespace wp-0 --create-namespace \
wp-release-0 oci://registry-1.docker.io/bitnamicharts/wordpress \
--wait
%: velero schedule create wp-hourly --include-namespaces wp-0 --schedule "@hourly"
This command instructs Velero to configure a regular backup (schedule in Velero’s terminology) that runs every hour and protects a namespace called wp-0.
You can list regular backups with the following command:
%: velero schedule get
Also, you can request the same information with a call to kubectl:
%: kubectl -n velero get schedules
After installing Velero, you can manage it using the Velero CLI tool or kubectl. All backups can be controlled with kubectl, enabling seamless integration of Velero into your GitOps pipeline.
Also you can run a backup of WP manually at any time with the following command:
%: kubectl -n velero get schedules
Once the backup is complete, let us simulate a disaster by deleting the WP namespace:
%: kubectl delete ns wp-0
And we can restore the WP application from the backup using this command:
%: velero restore create wp-after-disaster --from-backup=wp-bkp-test
There are a few important aspects of Velero data protection that require a separate blog post. In this section, we will cover the most critical points on data consistency and overall security measures.
Kubernetes applications may need to be paused or put in a special freeze state before they are backed up to achieve a recoverable backup. This is crucial for applications that write data to storage volumes or databases, as well as for applications running on several nodes. Pausing or quiescing these applications ensures no data changes occur during the backup process, capturing a consistent state of the data.
To take consistent backup of Kubernetes apps, Velero uses a mechanism called pre-/post- backup hooks, which we’ll cover in detail in a separate blog post. In short, hooks is a way to execute scripts that pause an application before a backup to ensure the backups are recoverable.
Velero has a default backup repository password saved in the secret [velero-repo-credentials]. This password is used to encrypt all backup data. It is of utmost importance to change the default password, as its default value is commonly known. Additionally, make sure to keep it secret, as Velero does not support changing the password for existing backup repositories.
Backup options
Velero helps implement automated Kubernetes application backup and accelerate the time to recovery, compared to disaster recovery plans reliant on Kubernetes deployment solutions and storage replication.
Using Velero you can manage backup and recovery of your entire K8s application, including its configs and storage. It is, therefore, much more robust than manual data protection and disaster recovery plans that require you to take multiple steps, including:
At the same time, Velero has a number of important limitations that may complicate its use in an enterprise production environment:
This blog post is based on the result of our internal Kubernetes backup product research. Afi develops cloud-based Kubernetes backup service which overcomes many of the issues inherent in Velero. Please feel free to check the product page to learn more or read other blog posts to learn more about Kubernetes data protection.
G Suite email backup options overview.
How to recover deleted G Suite Drive files, Gmail data and Contacts?