Skip to main content

Deploying OpenDSO on GCP with Helm

This guide walks through deploying OpenDSO to Google Kubernetes Engine (GKE) from scratch using Helm. It covers every step: provisioning GCP infrastructure, configuring DNS and TLS, preparing Kubernetes secrets, and deploying the Helm chart.

For other deployment options, see:


1. Prerequisites

1.1 Required Tools

Install all of the following tools before proceeding.

gcloud CLI

The Google Cloud SDK — required to manage GCP resources.

# Install (Linux/macOS)
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init

See the official install guide for Windows or alternative methods.

kubectl

The Kubernetes CLI. The simplest way to install it on GKE is via the gcloud components:

gcloud components install kubectl

Or download directly:

# Linux
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl && sudo mv kubectl /usr/local/bin/

Helm (v3.8+)

The Kubernetes package manager.

# Linux/macOS (script install)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

# macOS (Homebrew)
brew install helm

# Windows (Chocolatey)
choco install kubernetes-helm

Verify the version:

helm version
# Should show v3.8.0 or later

nk — NATS NKey Tool

Required to generate NATS authentication keys.

# Download the latest release for your OS from:
# https://github.com/nats-io/nkeys/releases
# Example for Linux amd64:
curl -LO https://github.com/nats-io/nkeys/releases/latest/download/nk-linux-amd64.zip
unzip nk-linux-amd64.zip
chmod +x nk && sudo mv nk /usr/local/bin/

jq and openssl

Used in helper scripts.

# Ubuntu/Debian
sudo apt-get install -y jq openssl

# macOS
brew install jq openssl

1.2 Tool Version Reference

ToolMinimum Version
gcloud CLIany current
kubectlany current
Helm3.8+
nk (NATS NKey)any current
cert-manager1.12+
Kubernetes (GKE)1.24+

2. GCP Project Setup

2.1 Create or Select a GCP Project

# Create a new project
gcloud projects create YOUR_PROJECT_ID --name="OpenDSO"

# Or select an existing project
gcloud config set project YOUR_PROJECT_ID

export GCP_PROJECT=YOUR_PROJECT_ID

2.2 Enable Required APIs

gcloud services enable \
container.googleapis.com \
dns.googleapis.com \
compute.googleapis.com \
artifactregistry.googleapis.com \
--project=$GCP_PROJECT

2.3 Configure Billing

Ensure billing is enabled for the project. GKE clusters require an active billing account.

# List billing accounts
gcloud billing accounts list

# Link billing account to project
gcloud billing projects link $GCP_PROJECT \
--billing-account=YOUR_BILLING_ACCOUNT_ID

2.4 Configure Image Pull Access

OpenDSO images are served from GCP Artifact Registry. The GKE node service account must have read access.

# Get the GKE node service account email (after cluster creation)
NODE_SA=$(gcloud container clusters describe $CLUSTER_NAME \
--zone=$GCP_ZONE --project=$GCP_PROJECT \
--format='value(nodeConfig.serviceAccount)')

# Default is Compute Engine default SA if not overridden:
# <project-number>-compute@developer.gserviceaccount.com

# Grant Artifact Registry reader role
gcloud projects add-iam-policy-binding $GCP_PROJECT \
--member="serviceAccount:${NODE_SA}" \
--role="roles/artifactregistry.reader"

If OES provides the registry URL and access, they will supply the exact registry path and grant access for your project.


3. Create the GKE Cluster

3.1 Set Variables

export GCP_PROJECT=your-project-id
export GCP_ZONE=us-east1-b
export CLUSTER_NAME=opendso-cluster-1
export NAMESPACE=opendso
export RELEASE_NAME=opendso
export DOMAIN=opendso.example.com

3.2 Create the Cluster

The following command creates a production-ready GKE cluster with autoscaling and all required addons:

gcloud container clusters create $CLUSTER_NAME \
--project=$GCP_PROJECT \
--zone=$GCP_ZONE \
--cluster-version=latest \
--machine-type=e2-standard-4 \
--num-nodes=3 \
--disk-size=50 \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--enable-autorepair \
--enable-autoupgrade \
--enable-ip-alias \
--network=default \
--subnetwork=default \
--no-enable-basic-auth \
--no-issue-client-certificate \
--enable-stackdriver-kubernetes \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver

Node sizing guidance:

Use --enable-autoscaling --min-nodes=1 --max-nodes=10 for production to handle variable load.

3.3 Authenticate kubectl

gcloud container clusters get-credentials $CLUSTER_NAME \
--zone=$GCP_ZONE \
--project=$GCP_PROJECT

# Verify
kubectl cluster-info
kubectl get nodes

3.4 Create the Application Namespace

kubectl create namespace $NAMESPACE

4. Install the nginx Ingress Controller

OpenDSO uses nginx as its ingress controller. All external traffic — HTTP, HTTPS, and NATS WebSocket — routes through it.

4.1 Install via Helm

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.service.type=LoadBalancer \
--set controller.service.externalTrafficPolicy=Local \
--wait --timeout 5m

4.2 Get the LoadBalancer IP

The LoadBalancer IP is needed for DNS configuration. Wait until it is assigned:

kubectl get svc -n ingress-nginx ingress-nginx-controller --watch

Once the EXTERNAL-IP column shows an IP (not <pending>):

export LB_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "LoadBalancer IP: $LB_IP"

Important: This IP is stable for the lifetime of the LoadBalancer service. Do not delete and recreate the service unless necessary.


5. Configure DNS with Cloud DNS

OpenDSO uses a single base domain with multiple subdomains (one per service). All subdomains point to the nginx LoadBalancer IP.

5.1 Create a Cloud DNS Managed Zone

export DNS_ZONE_NAME=opendso-zone

gcloud dns managed-zones create $DNS_ZONE_NAME \
--dns-name="${DOMAIN}." \
--description="OpenDSO DNS zone" \
--project=$GCP_PROJECT

5.2 Retrieve the Assigned Nameservers

gcloud dns managed-zones describe $DNS_ZONE_NAME \
--project=$GCP_PROJECT \
--format='value(nameServers)'

You will see 4 nameservers such as:

ns-cloud-a1.googledomains.com.
ns-cloud-a2.googledomains.com.
ns-cloud-a3.googledomains.com.
ns-cloud-a4.googledomains.com.

5.3 Delegate the Domain at Your Registrar

In your domain registrar's control panel, add NS records pointing to the 4 Google nameservers. This step delegates the subdomain to Cloud DNS.

Allow 15–30 minutes for NS record propagation. cert-manager DNS-01 challenges will fail if NS records have not propagated yet.

5.4 Create DNS A Records

# Start a DNS transaction
gcloud dns record-sets transaction start \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT

# Apex record
gcloud dns record-sets transaction add $LB_IP \
--name="${DOMAIN}." \
--ttl=300 \
--type=A \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT

# Wildcard record (covers all subdomains)
gcloud dns record-sets transaction add $LB_IP \
--name="*.${DOMAIN}." \
--ttl=300 \
--type=A \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT

# Commit the transaction
gcloud dns record-sets transaction execute \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT

5.5 Verify DNS Propagation

# Should return the LoadBalancer IP
dig +short $DOMAIN
dig +short keycloak.$DOMAIN
dig +short api.$DOMAIN

# Or using nslookup
nslookup $DOMAIN

Do not proceed to cert-manager setup until DNS resolves correctly. Let's Encrypt DNS-01 challenges require that your nameservers respond with the correct TXT records.


6. Install cert-manager and Configure TLS

OpenDSO requires a wildcard TLS certificate (*.yourdomain.com) to serve all subdomains over HTTPS. cert-manager automates certificate issuance from Let's Encrypt using DNS-01 validation.

6.1 Install cert-manager

helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.14.0 \
--set installCRDs=true \
--wait --timeout 5m

Verify all cert-manager pods are running:

kubectl get pods -n cert-manager

6.2 Apply the GKE cert-manager Fix (Required)

GKE restricts access to the kube-system namespace, which breaks cert-manager's default leader election configuration. Apply the fix before creating any issuers:

# Patch cert-manager to use its own namespace for leader election
kubectl patch deployment cert-manager \
-n cert-manager \
--type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--leader-election-namespace=cert-manager"}]'

# Create lease RBAC in cert-manager namespace
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cert-manager-leaderelection
namespace: cert-manager
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get","list","watch","create","update","patch","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cert-manager-leaderelection
namespace: cert-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cert-manager-leaderelection
subjects:
- kind: ServiceAccount
name: cert-manager
namespace: cert-manager
EOF

# Restart cert-manager to pick up the change
kubectl rollout restart deployment/cert-manager -n cert-manager
kubectl rollout status deployment/cert-manager -n cert-manager

6.3 Create a Service Account for DNS-01 Challenges

cert-manager needs permission to create DNS TXT records in Cloud DNS to prove domain ownership.

export CERT_MANAGER_SA=cert-manager-dns

# Create the service account
gcloud iam service-accounts create $CERT_MANAGER_SA \
--display-name="cert-manager DNS-01 solver" \
--project=$GCP_PROJECT

# Grant DNS admin role
gcloud projects add-iam-policy-binding $GCP_PROJECT \
--member="serviceAccount:${CERT_MANAGER_SA}@${GCP_PROJECT}.iam.gserviceaccount.com" \
--role="roles/dns.admin"

# Create and download a JSON key
gcloud iam service-accounts keys create cert-manager-dns-key.json \
--iam-account="${CERT_MANAGER_SA}@${GCP_PROJECT}.iam.gserviceaccount.com"

# Store the key as a Kubernetes secret in the cert-manager namespace
kubectl create secret generic clouddns-dns01-solver-svc-acct \
--from-file=key.json=cert-manager-dns-key.json \
-n cert-manager

# Clean up the local key file
rm cert-manager-dns-key.json

6.4 Create the Let's Encrypt ClusterIssuer

export LETSENCRYPT_EMAIL=admin@example.com

kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ${LETSENCRYPT_EMAIL}
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- dns01:
cloudDNS:
project: ${GCP_PROJECT}
hostedZoneName: ${DNS_ZONE_NAME}
serviceAccountSecretRef:
name: clouddns-dns01-solver-svc-acct
key: key.json
EOF

Tip: For testing, replace letsencrypt-prod with letsencrypt-staging and point the ACME server to https://acme-staging-v02.api.letsencrypt.org/directory. Staging has much higher rate limits. Let's Encrypt production is limited to 50 certificates per registered domain per week.

6.5 Request the Wildcard Certificate

kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: opendso-tls
namespace: ${NAMESPACE}
spec:
secretName: ${RELEASE_NAME}-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- "${DOMAIN}"
- "*.${DOMAIN}"
EOF

Monitor until the certificate is READY=True (typically 2–10 minutes):

kubectl get certificate -n $NAMESPACE --watch
kubectl describe certificate opendso-tls -n $NAMESPACE

If it is taking longer than 15 minutes, check the challenge status:

kubectl get challenges,orders -n $NAMESPACE
kubectl describe challenge -n $NAMESPACE

7. Prepare Kubernetes Secrets

All secrets must exist in the cluster before running helm install. These are not managed by Helm so they persist across upgrades and uninstalls.

7.1 Extract TLS Credentials

Once the Certificate is READY, extract the certificate and key for use in derived secrets:

kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE \
-o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/tls.crt

kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE \
-o jsonpath='{.data.tls\.key}' | base64 -d > /tmp/tls.key

7.2 Create Service-Specific TLS Secrets

Several services reference TLS material via their own named secrets:

# Root CA (used by services for trust)
kubectl create secret generic root-ca \
--from-file=ca.pem=/tmp/tls.crt \
-n $NAMESPACE

# server-cert and server-key (legacy aliases used by Keycloak and others)
kubectl create secret generic server-cert \
--from-file=server-cert.pem=/tmp/tls.crt \
-n $NAMESPACE

kubectl create secret generic server-key \
--from-file=server-key.pem=/tmp/tls.key \
-n $NAMESPACE

# MongoDB TLS secret (combined PEM format)
cat /tmp/tls.crt /tmp/tls.key > /tmp/mongodb.pem
kubectl create secret generic ${RELEASE_NAME}-mongodb-tls \
--from-file=mongodb.pem=/tmp/mongodb.pem \
--from-file=ca.crt=/tmp/tls.crt \
-n $NAMESPACE

# Clean up temp files
rm /tmp/tls.crt /tmp/tls.key /tmp/mongodb.pem

7.3 Generate NATS Authentication Keys

NATS uses NKey-based authentication. Generate fresh keys — these must be generated once and stored; they cannot change without reconfiguring NATS.

# Generate seeds
ACCOUNT_SEED=$(nk -gen account)
USER_SEED=$(nk -gen user)
XKEY_SEED=$(nk -gen curve)

# Derive public keys
ACCOUNT_PUB=$(echo "$ACCOUNT_SEED" | nk -inkey /dev/stdin -pubout)
USER_PUB=$(echo "$USER_SEED" | nk -inkey /dev/stdin -pubout)
XKEY_PUB=$(echo "$XKEY_SEED" | nk -inkey /dev/stdin -pubout)

echo "ACCOUNT_PUB: $ACCOUNT_PUB"
echo "USER_PUB: $USER_PUB"
echo "XKEY_PUB: $XKEY_PUB"

# Store private seeds in Kubernetes
kubectl create secret generic ${RELEASE_NAME}-nats-auth-keys \
--from-literal=account.nk="$ACCOUNT_SEED" \
--from-literal=user.nk="$USER_SEED" \
--from-literal=xkey.xk="$XKEY_SEED" \
-n $NAMESPACE

# Save public keys to a values override file for Helm
cat > /tmp/values-nats-auth-generated.yaml <<EOF
nats:
authCallout:
enabled: true
issuer: "${ACCOUNT_PUB}"
authUser: "${USER_PUB}"
xkey: "${XKEY_PUB}"

nats-auth-svc:
natsKeysSecret: "${RELEASE_NAME}-nats-auth-keys"
EOF

Important: Save values-nats-auth-generated.yaml alongside your other values files. The public keys must be passed to helm install at every upgrade.

7.4 Create Grafana Credentials Secret

GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 16)
CITUS_PASSWORD=$(openssl rand -base64 16)
MONGODB_PASSWORD=$(openssl rand -base64 16)
OPENDSO_APPS_DB_PASSWORD=$(openssl rand -base64 16)

kubectl create secret generic ${RELEASE_NAME}-grafana-credentials \
--from-literal=admin-user="admin" \
--from-literal=admin-password="$GRAFANA_ADMIN_PASSWORD" \
--from-literal=citus-password="$CITUS_PASSWORD" \
--from-literal=mongodb-password="$MONGODB_PASSWORD" \
--from-literal=opendso-apps-db-password="$OPENDSO_APPS_DB_PASSWORD" \
-n $NAMESPACE

echo "Grafana admin password: $GRAFANA_ADMIN_PASSWORD"
echo "Save the above password — it cannot be recovered from the secret."

7.5 Create Database Secrets

# OpenDSO Apps DB (TimescaleDB)
APPS_DB_PASSWORD=$(openssl rand -base64 16)
kubectl create secret generic ${RELEASE_NAME}-opendso-apps-db-secret \
--from-literal=password="$APPS_DB_PASSWORD" \
-n $NAMESPACE

# Citus DB
CITUS_DB_PASSWORD=$(openssl rand -base64 16)
kubectl create secret generic ${RELEASE_NAME}-citus-db-secret \
--from-literal=password="$CITUS_DB_PASSWORD" \
-n $NAMESPACE

echo "Save these passwords securely before proceeding."

7.6 Create License Secret

Contact Open Energy Solutions for your license credentials, then create the secret:

kubectl create secret generic ${RELEASE_NAME}-opendso-license \
--from-literal=LICENSE_KEY="your-license-key" \
--from-literal=LICENSE_INSTALLATION_KEY="your-installation-key" \
--from-literal=LICENSE_ENVIRONMENT_NAME="$(kubectl get namespace kube-system -o jsonpath='{.metadata.uid}')" \
--from-literal=LICENSE_API_URL="https://license.oesinc.com" \
-n $NAMESPACE

8. Deploy with Helm

8.1 Add the Helm Repository

# If using the chart from a Helm repo (contact OES for the repo URL):
helm repo add opendso <OES_HELM_REPO_URL>
helm repo update

# Or if working from a local chart directory:
cd /path/to/opendso-helm-charts

8.2 Download Chart Dependencies

cd opendso-helm-charts/opendso
helm dependency update
cd ..

This downloads the Grafana subchart (grafana-10.5.14.tgz) into opendso/charts/.

8.3 Pre-create the Topology ConfigMap

The topology site configuration (cim.xml) is 367 KB — larger than the 262 KB Kubernetes annotation limit for client-side apply. It must be pre-created using server-side apply:

kubectl create configmap ${RELEASE_NAME}-topology-genesis-site-config \
--from-file=cim.xml=opendso/configs/ieee13/topology-genesis/cim.xml \
--from-file=cimex.config=opendso/configs/ieee13/topology-genesis/cimex.config \
--namespace $NAMESPACE \
--dry-run=client -o yaml \
| kubectl apply --server-side --field-manager=helm-deployer -f -

8.4 Install the Chart

helm upgrade --install $RELEASE_NAME ./opendso \
--namespace $NAMESPACE \
--create-namespace \
-f opendso/values-gcp.yaml \
-f /tmp/values-nats-auth-generated.yaml \
--set global.domain="$DOMAIN" \
--set global.environment.apiUrl="https://api.${DOMAIN}" \
--set global.keycloak.url="https://keycloak.${DOMAIN}" \
--set global.keycloak.internalUrl="http://${RELEASE_NAME}-keycloak-svc:8080" \
--set keycloak.config.hostname="keycloak.${DOMAIN}" \
--set global.resourceProfile="production" \
--set global.tls.existingSecret="${RELEASE_NAME}-tls-secret" \
--set ingress.tls.secretName="${RELEASE_NAME}-tls-secret" \
--set nats.tls.secretName="${RELEASE_NAME}-tls-secret" \
--set mongodb.tls.existingSecret="${RELEASE_NAME}-mongodb-tls" \
--set grafana.admin.existingSecret="${RELEASE_NAME}-grafana-credentials" \
--set "grafana.envValueFrom.CITUS_PASSWORD.secretKeyRef.name=${RELEASE_NAME}-grafana-credentials" \
--set "grafana.envValueFrom.OPENDSO_APPS_DB_PASSWORD.secretKeyRef.name=${RELEASE_NAME}-grafana-credentials" \
--set "global.topology-genesis.externalConfigMap=true" \
--wait \
--timeout 15m

Resource profile options: minimal (dev/test), default, production.

If the cluster is slow or nodes are still provisioning, increase the --timeout to 20m or 30m.

8.5 Monitor the Rollout

In a separate terminal, watch pods coming up:

kubectl get pods -n $NAMESPACE --watch

Typical startup order: databases → Keycloak → NATS → API services → frontend apps. Expect 5–10 minutes for all pods to reach Running state.


9. Verify the Deployment

9.1 Check Pod Status

All pods should be 1/1 Running (or 2/2, 3/3 for multi-container pods):

kubectl get pods -n $NAMESPACE

Common init containers (mongodb-init, citus-init, opendso-apps-db-init) will be in Completed state — this is expected.

9.2 Check Ingress

kubectl get ingress -n $NAMESPACE

All ingress rules should show the LoadBalancer IP in the ADDRESS column.

9.3 Test Key Endpoints

# Keycloak OIDC discovery
curl -k https://keycloak.${DOMAIN}/realms/oes/.well-known/openid-configuration | jq .issuer

# GMS API health (expects 401 — auth-protected, which means the API is up)
curl -o /dev/null -w "%{http_code}\n" https://api.${DOMAIN}/api/health

# NATS WebSocket upgrade (expects 101 Switching Protocols)
curl -o /dev/null -w "nats-ws: %{http_code}\n" \
--http1.1 \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Key: SGVsbG8sIFdvcmxkIQ==" \
-H "Sec-WebSocket-Version: 13" \
https://nats.${DOMAIN}

9.4 Access the UI

Open https://<domain> in a browser. You should be redirected to the Keycloak login page.


10. Service Endpoints

All services are accessed via HTTPS subdomains of the base domain.

URLServiceNotes
https://<domain>Main UI (genesis-node-app)Primary entry point
https://keycloak.<domain>Keycloak (identity provider)Login portal
https://api.<domain>GMS REST APIReturns 401 without auth token
wss://nats.<domain>NATS WebSocketUsed by browser clients
https://grafana.<domain>Grafana dashboardsUse Grafana credentials secret
https://gis.<domain>GIS map viewer
https://oneline.<domain>One-line diagram
https://dataviewer.<domain>Data viewer
https://eventviewer.<domain>Event viewer
https://inventory.<domain>Inventory manager
https://historian.<domain>Historian app
https://device.<domain>ESS manager
https://esstesting.<domain>ESS tester
https://derdispatch.<domain>DER dispatch
https://scheduledispatch.<domain>Schedule dispatch
https://openfmb.<domain>OpenFMB inspector
https://openfmbeventcreator.<domain>OpenFMB event creator
https://docs.<domain>Documentation

11. Troubleshooting

11.1 Useful Diagnostic Commands

# Pod status and recent events
kubectl get pods -n $NAMESPACE
kubectl describe pod <pod-name> -n $NAMESPACE

# Pod logs (last 100 lines)
kubectl logs <pod-name> -n $NAMESPACE --tail=100

# Follow logs in real time
kubectl logs -f <pod-name> -n $NAMESPACE

# Events across the namespace (sorted by time)
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'

# Certificate status
kubectl get certificate,certificaterequest,order,challenge -n $NAMESPACE

# Ingress details
kubectl describe ingress -n $NAMESPACE

# Check a service is resolving inside the cluster
kubectl run -it --rm debug --image=busybox --restart=Never -- \
nslookup ${RELEASE_NAME}-gms-api.$NAMESPACE.svc.cluster.local

11.2 GMS API Not Starting

Symptom: gms-api pod shows Running but the /api/health endpoint is unreachable (connection refused on port 8000).

Root cause: The API silently fails to start if the MongoDB authsettings document with environmentId='default' does not exist. The HTTP server never binds.

Fix: Verify the mongodb-init job completed:

kubectl get jobs -n $NAMESPACE
kubectl logs job/${RELEASE_NAME}-mongodb-init -n $NAMESPACE

If the job failed or the document is missing, create it manually:

# Get the MongoDB root password
MONGO_PASSWORD=$(kubectl get secret ${RELEASE_NAME}-mongodb \
-n $NAMESPACE -o jsonpath='{.data.mongodb-root-password}' | base64 -d)

# Connect to MongoDB
kubectl exec -it ${RELEASE_NAME}-mongodb-0 -n $NAMESPACE -- mongosh \
-u root -p "$MONGO_PASSWORD" --authenticationDatabase admin

# Inside the shell:
use settings_api
db.authsettings.insertOne({
environmentId: 'default',
keycloakUrl: 'https://${RELEASE_NAME}-keycloak-svc:8443',
keycloakRealm: 'oes',
keycloakClientId: 'gms',
createdAt: new Date(),
updatedAt: new Date()
})
exit

Then restart the API:

kubectl rollout restart deployment/${RELEASE_NAME}-gms-api -n $NAMESPACE

11.3 NATS Pod CrashLoopBackOff

Symptom:

unable to load certificates

Fix: The TLS secret is missing or named incorrectly.

# Verify the secret exists
kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE

# If missing, recreate it (cert-manager should have created it)
kubectl get certificate opendso-tls -n $NAMESPACE
kubectl describe certificate opendso-tls -n $NAMESPACE

# After fixing the secret, restart NATS
kubectl rollout restart deployment/${RELEASE_NAME}-nats -n $NAMESPACE

11.4 cert-manager Stuck in kube-system Error

Symptom:

leases.coordination.k8s.io is forbidden: User 'system:serviceaccount:cert-manager:cert-manager'
cannot create resource 'leases' in the namespace 'kube-system'

Fix: Re-apply the GKE cert-manager leader election fix from section 6.2.


11.5 Certificate Stuck in Issuing State

Diagnosis steps:

# Check challenge status
kubectl get challenges -n $NAMESPACE
kubectl describe challenge -n $NAMESPACE | grep -A 20 "Status:"

# Check cert-manager controller logs
kubectl logs -n cert-manager deployment/cert-manager --tail=100

# Verify DNS resolves from outside the cluster
dig +short @8.8.8.8 $DOMAIN

Common causes:

  1. DNS not propagated — Check that dig +short $DOMAIN returns the LoadBalancer IP from outside the cluster. Wait and retry.
  2. NS records not set at registrar — Verify the registrar's NS records point to the Cloud DNS nameservers.
  3. cert-manager DNS-01 IAM issue — Verify the service account has roles/dns.admin and the secret clouddns-dns01-solver-svc-acct exists in the cert-manager namespace.
  4. Rate limited by Let's Encrypt — Check cert-manager logs for too many certificates errors. Use staging issuer while testing.

11.6 MongoDB Permission Denied on /data/db

Symptom: MongoDB pod fails with:

chown: /data/db: Operation not permitted

Fix: The pod security context must set fsGroup: 999. This should be set by the chart. If it is missing, override in values:

mongodb:
podSecurityContext:
runAsUser: 999
runAsGroup: 999
fsGroup: 999

Then redeploy and delete the stuck pod to let it reschedule with the correct context.


11.7 Database Init Scripts Did Not Run

PostgreSQL (Citus DB, OpenDSO Apps DB) only runs init scripts when the data directory is empty (first initialization). If the PVC was pre-existing and the init scripts were skipped:

# Run init scripts manually on Citus DB
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-citus-db-0 -- \
psql -U citususer -f /docker-entrypoint-initdb.d/init.sql

# Run init scripts manually on OpenDSO Apps DB
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
psql -U essuser -d ess_tester -f /docker-entrypoint-initdb.d/10_ess_tester.sql

To force re-initialization (destructive — all data lost):

kubectl delete pvc data-${RELEASE_NAME}-opendso-apps-db-0 -n $NAMESPACE
kubectl delete pod ${RELEASE_NAME}-opendso-apps-db-0 -n $NAMESPACE

11.8 Topology-Genesis ConfigMap Too Large

Symptom:

The ConfigMap "...-topology-genesis-site-config" is invalid: metadata.annotations: Too long

Fix: The cim.xml file (367 KB) exceeds the client-side apply annotation limit of 262 KB. Pre-create the ConfigMap with server-side apply as described in section 8.3, and ensure global.topology-genesis.externalConfigMap: true is set in your values.


11.9 LoadBalancer IP Stuck in <pending>

On standard GKE this resolves within 1–2 minutes. If it stays pending:

# Check for quota issues or provisioning errors
kubectl describe svc ingress-nginx-controller -n ingress-nginx

# Check GCP quotas
gcloud compute project-info describe --project=$GCP_PROJECT | grep -A5 quota

11.10 Image Pull Errors (ImagePullBackOff)

kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 10 "Failed"

Causes:

  • GKE node service account lacks roles/artifactregistry.reader — see section 2.4.
  • Wrong global.imageRegistry value in Helm values — verify the registry URL with OES.

11.11 Grafana Pods Pending

kubectl describe pod ${RELEASE_NAME}-grafana-xxx -n $NAMESPACE | grep -A5 "Events:"

If the cause is Insufficient memory or Insufficient cpu, the cluster nodes are resource-constrained. Either:

  • Scale up the node pool: gcloud container clusters resize $CLUSTER_NAME --num-nodes=4 ...
  • Use a smaller resource profile: --set global.resourceProfile=default

11.12 Keycloak Login Redirect Loop

Symptom: Logging in at https://keycloak.<domain> redirects in a loop or shows a blank page.

Check:

kubectl logs deployment/${RELEASE_NAME}-keycloak -n $NAMESPACE --tail=100 | grep -i error

Common causes:

  1. global.keycloak.url does not match the actual hostname — ensure it is https://keycloak.<domain> (no trailing slash).
  2. TLS certificate CN does not cover keycloak.<domain> — verify the wildcard cert covers *.<domain>.
  3. Keycloak realm was not imported — check for realm import in logs. If the database was initialized but the realm was not imported, delete the Keycloak PVC and redeploy.

12. Backup and Restore

12.1 Database Backups

Run these from your local machine after authenticating kubectl.

MongoDB

# Dump
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-mongodb-0 -- \
mongodump --username root \
--authenticationDatabase admin \
--out /tmp/mongodump

kubectl cp $NAMESPACE/${RELEASE_NAME}-mongodb-0:/tmp/mongodump ./backups/mongodump-$(date +%Y%m%d)

Citus DB (Historian)

kubectl exec -n $NAMESPACE ${RELEASE_NAME}-citus-db-0 -- \
pg_dump -U citususer ofmb_db > ./backups/citus-$(date +%Y%m%d).sql

OpenDSO Apps DB (ESS / Assets)

kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
pg_dump -U essuser ess_tester > ./backups/ess-tester-$(date +%Y%m%d).sql

kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
pg_dump -U essuser assets > ./backups/assets-$(date +%Y%m%d).sql

Redis

kubectl exec ${RELEASE_NAME}-ess-manager-redis-0 -n $NAMESPACE -- redis-cli SAVE

kubectl cp $NAMESPACE/${RELEASE_NAME}-ess-manager-redis-0:/data/dump.rdb \
./backups/redis-$(date +%Y%m%d).rdb

12.2 Secrets Backup

These secrets are not managed by Helm and persist across helm uninstall. Back them up to a secure vault:

for secret in \
${RELEASE_NAME}-nats-auth-keys \
${RELEASE_NAME}-opendso-license \
${RELEASE_NAME}-tls-secret \
${RELEASE_NAME}-mongodb-tls \
${RELEASE_NAME}-grafana-credentials \
${RELEASE_NAME}-opendso-apps-db-secret \
${RELEASE_NAME}-citus-db-secret \
root-ca server-cert server-key; do
kubectl get secret $secret -n $NAMESPACE -o yaml \
> ./backups/secret-${secret}-$(date +%Y%m%d).yaml
echo "Backed up: $secret"
done

Warning: These YAML files contain base64-encoded secrets. Store them in an encrypted location (e.g., GCP Secret Manager, Vault).

12.3 Restore Order

When restoring to a new cluster:

  1. Create the GKE cluster and configure kubectl
  2. Install ingress-nginx and cert-manager (sections 4–6)
  3. Re-create the namespace and all secrets from backups
  4. helm upgrade --install with the same values
  5. Restore database contents after pods are running
  6. Verify the deployment (section 9)

13. Teardown

Remove the Helm Release

helm uninstall $RELEASE_NAME -n $NAMESPACE

Persistent Volume Claims are not deleted by helm uninstall. Delete them explicitly if you want to free storage:

kubectl delete pvc -n $NAMESPACE --all

Delete Secrets

kubectl delete secret -n $NAMESPACE \
${RELEASE_NAME}-tls-secret \
${RELEASE_NAME}-mongodb-tls \
${RELEASE_NAME}-nats-auth-keys \
${RELEASE_NAME}-grafana-credentials \
${RELEASE_NAME}-opendso-apps-db-secret \
${RELEASE_NAME}-citus-db-secret \
root-ca server-cert server-key

Delete the Namespace

kubectl delete namespace $NAMESPACE

Remove DNS Records

gcloud dns record-sets transaction start --zone=$DNS_ZONE_NAME --project=$GCP_PROJECT

gcloud dns record-sets transaction remove $LB_IP \
--name="${DOMAIN}." --ttl=300 --type=A \
--zone=$DNS_ZONE_NAME --project=$GCP_PROJECT

gcloud dns record-sets transaction remove $LB_IP \
--name="*.${DOMAIN}." --ttl=300 --type=A \
--zone=$DNS_ZONE_NAME --project=$GCP_PROJECT

gcloud dns record-sets transaction execute --zone=$DNS_ZONE_NAME --project=$GCP_PROJECT

gcloud dns managed-zones delete $DNS_ZONE_NAME --project=$GCP_PROJECT

Delete the GKE Cluster

gcloud container clusters delete $CLUSTER_NAME \
--zone=$GCP_ZONE \
--project=$GCP_PROJECT