Vault anti-patterns
Introduction
The Vault anti-patterns highlighted in this document are sourced from lessons learned by practitioners operating Vault in the field. As a Vault administrator, you can help keep your Vault environments healthy by avoiding these anti-patterns.
Anti-patterns
Description | Applicable Vault edition |
---|---|
Not adjusting the default lease time | All |
Not using entities for accurate client count | Enterprise, HCP |
Limiting IOPS | Enterprise, Community |
Not testing disaster recovery solution | Enterprise |
Production clusters with no disaster recovery | Enterprise |
Slow upgrade cadence | Enterprise, Community |
Upgrading Vault without proper testing | Enterprise, Community |
Not rotating audit device logs | Enterprise, Community |
Poor metrics or no telemetry data | Enterprise, Community |
No baseline of activity or usage data | Enterprise, Community |
Using the root token for routine actions | All |
Not rekeying Vault after key-holders exit | All |
Not adjusting the default lease time
The default lease time in Vault is 32 days or 768 hours. This time allows for some operations, such as re-authentication or renewal. See lease documentation for more information.
Potential issue:
If you create leases without changing the default time-to-live (TTL), leases will live in Vault until the default lease time is up. Depending on your infrastructure and available system memory, using the default or long TTL may cause performance issues as Vault stores leases in memory.
Solution:
You should tune the lease TTL value for your needs. Vault holds leases in memory until the lease expires. We recommend keeping TTLs as short as the use case will allow.
Note
Tuning or adjusting TTLs does not retroactively affect tokens that were issued. New tokens must be issued after tuning TTLs.Not using entities for accurate client count
Each Vault client may have multiple accounts with the auth methods enabled on the Vault server.
Potential issue:
Each new client is counted as a identity when using another auth method not linked to the user's entity.
Solution:
Since each token adds to the client count, and each unique authentication issues a token, it is best to use identity entities to create aliases that connect each login to a single identity.
- Client count
- Vault identity concepts
- Vault Identity secrets engine
- Identity: Entities and groups tutorial
Limiting IOPS
IOPS (input/output operations per second) measures performance for Vault cluster members. Vault is bound by the IO limits of the storage backend rather than the compute requirements.
Potential issue:
Limiting IOPS can have a significant performance impact.
Solution:
Use the HashiCorp reference guidelines for hardware sizing and network considerations for Vault servers.
Note
The Transform (Enterprise) and Transit secret engines can be resource intensive depending on the client count.
Not testing disaster recovery solution
Your disaster recovery (DR) solution is a key part of your overall disaster recovery plan. Designing and configuring your Vault disaster recovery solution is only the first step. You also need to validate the DR solution. Not doing so can negatively impact your organization's Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Potential issue:
If you don't test your disaster recovery solution, your key stakeholders will not feel confident they can effectively perform the disaster recovery plan. Testing the DR solution removes uncertainty if the DR plan will recover the system during an outage.
Solution:
Vault's Disaster Recovery (DR) replication mode provides a warm standby for failover if the primary cluster experiences catastrophic failure. You should periodically test the disaster recovery replication cluster by completing the failover and failback procedure.
- Vault disaster recovery replication failover and failback tutorial
- Vault Enterprise replication
- Monitoring Vault replication
Production clusters with no disaster recovery
HashiCorp Vault's (HA) highly available Integrated storage (Raft) backend provides intra-cluster data replication across cluster members. Integrated Storage provides Vault with horizontal scalability and failure tolerance, but it does not provide backup for the entire cluster. Not utilizing disaster recovery for your production environment will negatively impact your organization's Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Potential issue:
If catastrophic failure occurs, there will be downtime and cost associated with not serving Vault clients in your environment.
Solution:
For cluster-wide issues (i.e., network connectivity), Vault Enterprise Disaster Recovery (DR) replication provides a warm standby cluster containing all primary cluster data. The DR cluster does not service reads or writes but you can promote it to replace the primary cluster when needed.
- Disaster recovery replication setup
- Disaster recovery (DR) replication
- DR replication API documentation
Slow upgrade cadence
While it might be easy to upgrade Vault whenever you have capacity, not having a frequent upgrade cadence can impact your Vault performance and security.
Potential issue:
- Missing patches for bugs or vulnerabilities as documented in the CHANGELOG.
- New features to improve workflow.
- Must use version-specific rather than the latest documentation.
- Some educational resources require a specific minimum Vault version.
- Updates may require a stepped approach that uses an intermediate version before installing the latest binary.
Solution:
We recommend upgrading to our latest version of Vault. Subscribe to the releases in Vault's GitHub repository, and notifications from HashiCorp Vault discuss, will notify you when a new version of Vault is available.
Upgrading Vault without proper testing
We recommend testing Vault in a sandbox environment before deploying to production. Although it might be faster to upgrade immediately in production, testing will help identify any compatibility issues.
Be aware of the CHANGELOG and account for any new features, improvements, known issues and bug fixes in your testing.
Potential issue:
Without adequate testing before upgrading in production, you risk compatibility and performance issues. This could lead to downtime or degradation in your production Vault environment.
Solution:
Test new Vault versions in sandbox environments before upgrading in production and follow our upgrading documentation. We recommend adding a testing phase to your standard upgrade procedure.
Not rotating audit device logs
Audit devices in Vault maintain a detailed log of every authenticated requests and responses. If you allow the logs for audit devices to run perpetually without rotating you may face a blocked audit device.
Potential issue:
Vault will not respond to requests when no available (enabled) audit devices can record them. If the Audit log is not maintained and rotated over time it can consume the local storage.
Solution:
Inspect and rotate audit logs periodically.
Poor metrics or no telemetry data
Solely relying on Vault operational logs and data in Vault UI will give you a partial picture of how the cluster performs.
Potential issue:
Having a partial insight into cluster activity can leave the business in a reactive state.
Solution:
Continuous monitoring will allow organizations to detect minor problems and promptly resolve them. Migrating from reactive to proactive monitoring will help to prevent system failures. Vault has multiple outputs that help monitor the cluster's activity: audit logs, operational logs, and telemetry data. This data can work with a SIEM (security information and event management) tool for aggregation, inspection, and alerting capabilities.
Adding a monitoring solution:
- Audit device logs and incident response with elasticsearch
- Monitor telemetry & audit device log data
- Monitor telemetry with Prometheus & Grafana
Note
Vault logs to standard output and standard error by default. This is automatically captured by the systemd journal. Vault operational logs to can be directed to any file.
No baseline of activity or usage data
A baseline can provide insight into current utilization and thresholds. Telemetry metrics are valuable, especially when monitored over time. You can use telemetry metrics to gather a baseline of cluster activity, while alerts allow you to see when abnormal activity is present.
Potential issue:
This issue is closely linked to the poor metrics anti-pattern. Telemetry data is only held in memory for a short period of time.
Solution:
Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions and saved for aggregation and inspection.
Using root token for routine actions
When you initialize a Vault server, it emits an initial root token that gives root-level access across all Vault features.
Potential issue:
The root tokens can perform all actions within Vault and never expire. Unrestricted access can give users higher privileges than necessary to all Vault operations and paths. There is a security risk with sharing and providing access to a root token.
Solution:
We recommend revoking the root token after initializing Vault within your environment. If elevated access is required, create policies that grant access to the proper paths in Vault. If the root token is required, only keep the token for the shortest time needed to operate.
Not rekeying Vault after key-holders exit
Vault's unseal keys are distributed to stakeholders. A quorum of keys is needed to unlock Vault based on your initialization settings.
Potential issue:
If multiple stakeholders leave the organization there is a risk of not meeting enough keys for quorum.
Solution:
Vault supports rekeying, depending on the seal type the process will defer.