Hashicorp Vault
Objective
Deploy and configure a highly available Hashicorp Vault cluster.
Background
Vault serves several important purposes in the Panfactum stack:
- acts as the root certificate authority for each environment's X.509 certificate infrastructure
- authorizes SSH authentication to our bastion hosts
- provisions (and de-provisions) dynamic credentials for stack's supported databases
Deploy Vault
We will use the kube_vault infrastructure module to deploy Vault in a highly available manner using it's integrated Raft storage backend.
Let's do this now:
-
Create a new directory adjacent to your
aws_eks
module calledkube_vault
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
For now, set
vpa_enabled
tofalse
. We will enable it when we install the autoscalers. -
For now, set
ingress_enabled
tofalse
. We will enable it when we install the ingress subsystem. -
Add a
module.yaml
that enables theaws
,kubernetes
,random
, andhelm
providers. -
Run
terragrunt apply
. -
Note that this deployment may or may not succeed, but you should see three vault instances which are in the unready state:
Checking the logs of one of the Vault pods will show error messages that look like this (the exact order and information may appear different):
│ 2024-03-19T13:06:57.989Z [WARN] failed to unseal core: error="stored unseal keys are supported, but none were found" │ │ 2024-03-19T13:06:58.600Z [INFO] core: security barrier not initialized │ │ 2024-03-19T13:06:58.600Z [INFO] core.autoseal: recovery seal configuration missing, but cannot check old path as core is sealed │ │ 2024-03-19T13:06:59.710Z [INFO] core: security barrier not initialized │ │ 2024-03-19T13:06:59.712Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-1.vault-internal:8200 │ │ 2024-03-19T13:06:59.712Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-2.vault-internal:8200 │ │ 2024-03-19T13:06:59.712Z [INFO] core: attempting to join possible raft leader node: leader_addr=https://vault-0.vault-internal:8200 │ │ 2024-03-19T13:06:59.717Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault-2.vault-internal:8200 error="error durin │ │ 2024-03-19T13:06:59.719Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault-0.vault-internal:8200 error="error durin │ │ 2024-03-19T13:06:59.719Z [ERROR] core: failed to get raft challenge: leader_addr=https://vault-1.vault-internal:8200 error="error durin │ │ 2024-03-19T13:06:59.719Z [ERROR] core: failed to retry join raft cluster: retry=2s err="failed to get raft challenge"
This is because Vault must be manually initialized on first use.
Initialize Vault
Vault must be initialized which ultimately requires setting the root encryption key used to store data on disk. Once each Vault pod in the Vault cluster is initialized, the Vault cluster will become available for use. 1
Root Access
The root token is constructed via Shamir's secret sharing algorithm. This means that you will construct subkeys and require subkeys (where ) to recreate a root token. This will enable you to restore root access to the Vault cluster. These subkeys are called the Recovery Keys in the Vault documentation. 2
As Vault (a) controls many forms of authentication and (b) needs to be accessible on a public endpoint to authenticate users, these keys are incredibly sensitive as collectively they allow root access to a substantial portion of your ecosystem.
We recommend you consider the following before proceeding:
- How many people do you want to be superusers? (
recovery-shares
) 3 - How many superusers must work together to gain root access to Vault? (
recovery-threshold
) 4 - How will your organization recommend superusers store these keys? 5
- How will your organization codify the process and timings of key rotations? 6
Note that you can change the shares and threshold later.
Once you have answers to the above, you are ready to proceed.
Initializing Each Vault Pod
Right now, the Vault pods are only accessible from inside the Kubernetes cluster. We will leverage k9s to establish a remote terminal session to perform the initialization:
-
Find the Vault pods in the
vault
namespace in k9s (vault-0
,vault-1
,vault-2
). -
Establish a shell in
vault-0
by pressings
once the pod is highlighted. -
Run
vault operator init -recovery-shares=<...> -recovery-threshold=<...>
. Setrecovery-shares
andrecovery-threshold
to the values you decided above. You should see a message like the following:Recovery Key 1: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXx Initial Root Token: hvs.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Success! Vault is initialized Recovery key initialized with 1 key shares and a key threshold of 1. Please securely distribute the key shares printed above.
-
Save both the recovery keys and the root token in a safe location. The root token allows root access to the vault instance. The recovery keys allow creating new root tokens.
-
Now run
vault operator unseal
to unseal vault. Provide the recovery keys from the previous step when prompted. After each unseal operation you should see the below message. Continue running unseal operations until you you seeSealed
asfalse
in the output (using a different recovery key each time).Key Value --- ----- Recovery Seal Type shamir Initialized true Sealed false Total Recovery Shares 1 Threshold 1 Version 1.15.2 Build Date 2023-11-06T11:33:28Z Storage Type raft Cluster Name vault-cluster-0d4df53a Cluster ID ecc7f4ba-0c8e-3eca-2ad6-515ff027202e HA Enabled true HA Cluster https://vault-0.vault-internal:8201 HA Mode active Active Since 2024-03-19T14:36:28.434643111Z Raft Committed Index 58 Raft Applied Index 58
-
When this node is unsealed, the other nodes will automatically join the cluster. Verify this by:
-
Set the
VAULT_TOKEN
to the root token from the prior steps viaexport VAULT_TOKEN=<your_token>
. -
Run
vault operator members
. -
You should see an output with all three members:
Host Name API Address Cluster Address Active Node Version Upgrade Version Redundancy Zone Last Echo --------- ----------- --------------- ----------- ------- --------------- --------------- --------- vault-2 http://10.0.114.250:8200 https://vault-2.vault-internal:8201 false 1.15.2 1.15.2 n/a 2024-03-19T15:04:52Z vault-0 http://10.0.178.10:8200 https://vault-0.vault-internal:8201 true 1.15.2 1.15.2 n/a n/a vault-1 http://10.0.212.45:8200 https://vault-1.vault-internal:8201 false 1.15.2 1.15.2 n/a 2024-03-19T15:04:52Z
-
-
Exit the shell by running
exit
. You should see the pod is now ready: -
Notice that you should also see multiple Vault services:
These have the following uses:
vault
: All Vault podsvault-active
: Will always be the active Vault pod which is the one that you should connect with when performing Vault operations. 7vault-internal
: The headless service for the StatefulSet that can be used for addressing the pods individuallyvault-standby
: Non-active Vault podsvault-ui
: The service that exposes the web UI (different port than the Vault API)
Configure Vault
One of the major benefits of Vault is that it can be configured directly with OpenTofu (Terraform) alongside our other infrastructure components. The Panfactum stack includes several modules for working with Vault. In this section we will deploy some foundational configuration via the vault_core_resources module.
Connect to Vault
As Vault is required to set up our ingress system, we will need to set up connectivity to Vault without it having a publicly available endpoint.
Here is how we will work around this issue:
-
Establish a proxy to the active Vault node from your local machine by running the following command in a free terminal:
kubectl -n vault port-forward svc/vault-active 8200:8200
-
Set the
VAULT_ADDR
in your.env
file as follows:VAULT_ADDR=http://127.0.0.1:8200
-
Use the root token you received in the previous section to set the
VAULT_TOKEN
in your.env
file as follows:VAULT_TOKEN=hvs.xxxxxxxxxxxxxxxxxxxxxxx
-
Run
vault status
to verify you are able to connect to the cluster. You should receive an output that looks similar to below:Key Value --- ----- Seal Type shamir Recovery Seal Type n/a Initialized true Sealed false Total Recovery Shares 1 Threshold 1 Version 1.15.2 Build Date 2023-11-06T11:33:28Z Storage Type raft Cluster Name vault-cluster-59bb7d60 Cluster ID 9a5488b6-9966-5bd9-271d-51373057c52e HA Enabled true HA Cluster https://vault-0.vault-internal:8201 HA Mode active Active Since 2024-03-19T15:01:49.867966233Z Raft Committed Index 110 Raft Applied Index 110
Deploy Configuration Module
Let's now deploy vault_core_resources:
-
Create a new directory adjacent to your
kube_vault
module calledvault_core_resources
. -
Add a
terragrunt.hcl
to that directory that looks like this. -
Add a
module.yaml
that enables thekubernetes
,time
, andvault
providers. -
Run
terragrunt apply
.
Testing
Before we move on, let's verify a few parts of the Vault infrastructure are working as intended.
Pod Restart
Let's ensure that Vault pods can restart without manual intervention. This entails two pieces of functionality: automatic unsealing and automatic updating of the active service address.
-
Using k9s, delete the
vault-0
pod (<ctrl-d>
). -
When it restarts, the container should contain the following log messages (press enter on the pod):
│ 2024-03-19T16:07:55.131Z [INFO] core: vault is unsealed │ │ 2024-03-19T16:07:55.131Z [INFO] core: entering standby mode │ │ 2024-03-19T16:07:55.195Z [INFO] core: unsealed with stored key
-
Locally, run
vault status
and verify the connection still works. Likely, you will need to restart thekubectl port-forward
command from the previous section to re-establish the proxy since the pod terminated.
Verify Storage
Each Vault node stores its data on EBS volumes provisioned by the AWS EBS CSI driver. Recall setting that up in the prior section.
Let's verify the volumes were provisioned:
-
In k9s, you should see three PersistentVolumeClaims, one for each Vault node:
-
Log into the AWS web console. Navigate to EC2 > Elastic Block Store > Volumes. You should see the three underlying EBS volumes:
Next Steps
Now that Vault is set up, we will use it to set up the cluster's X.509 infrastructure in the next section.
Footnotes
-
Vault only keeps unencrypted data in memory and this includes the encryption key. As a result, when pods restart, the encryption key must provided again in a process called unsealing. Normally, this is a manual process, but Panfactum's module utilizes AWS KMS to automate the unsealing operation. This significantly reduces the burden of running a Vault cluster on Kubernetes. ↩
-
The recovery keys are not the unseal keys. We do not utilize static unseal keys in the Panfactum setup as we rely on AWS KMS to perform the unsealing for us. While this prevents breaches that rely only on getting access to the Vault database, it does mean that you will not be able to access Vault if you delete the KMS key. For that reason, we replicate the key to a secondary region and add a 30-day countdown timer to any key deletion operation. ↩
-
This is highly dependent on organization size, but we recommend the following:
1
if you are a solo operator,2
if you have at least one other person working on infrastructure (to reduce your bus factor), and no more than5
at the largest organization size (to minimize the burden of regular key rotation). ↩ -
It can be tempting to make this a high number, but you will need these keys fairly regularly (expect about once per quarter). We recommend in sensitive environments you use
2
so there is at least one check on root access. In less sensitive environments, you can make this1
for convenience. ↩ -
This should not be in a location that is accessible by all superusers (e.g., a company password vault). ↩
-
Make sure you consider both offboarding and time-based rotations (once per quarter or year). ↩
-
The other nodes cannot accept vault commands and are only on standby in case the active pod terminates. ↩