Production Deployment Troubleshooting (AWS/Lab/Field)

This guide covers troubleshooting issues specific to production deployments on AWS EC2 using Terraform, including Lab and Field environments.

Terraform Deployment Issues

Terraform Apply Fails

Error: Unable to Download Releases

Symptom:

./run.sh -i ./client_name.tfvars -p -s
# Error downloading release assets from GitHub

Diagnosis:

# Test GitHub token manually
curl -H "Authorization: Bearer $GITHUB_TOKEN" \
  https://api.github.com/repos/openenergysolutions/opendso-docker-compose/releases/latest

# Check token permissions
# Should have "repo" scope for private repositories

Solution:

# Generate new GitHub PAT with correct permissions
# Go to: GitHub Settings → Developer settings → Personal access tokens

# Test new token
export GITHUB_TOKEN="your-new-token"
curl -H "Authorization: Bearer $GITHUB_TOKEN" \
  https://api.github.com/repos/openenergysolutions/opendso-docker-compose/releases/latest

# Retry deployment
./run.sh -i ./client_name.tfvars -p -s

Error: Insufficient AWS Permissions

Symptom:

terraform apply
# Error: UnauthorizedOperation: You are not authorized to perform this operation

Required IAM Permissions:

EC2: RunInstances, DescribeInstances, TerminateInstances
EC2: CreateSecurityGroup, AuthorizeSecurityGroupIngress, DeleteSecurityGroup
EC2: CreateKeyPair, DeleteKeyPair, DescribeKeyPairs
VPC: DescribeVpcs, DescribeSubnets

Solution:

# Check AWS CLI credentials
aws sts get-caller-identity

# Verify IAM permissions
aws iam get-user-policy --user-name your-username --policy-name your-policy

# Use appropriate AWS profile
export AWS_PROFILE=opendso-admin
aws configure list

# Or set credentials explicitly
aws configure

Error: VPC/Subnet Not Found

Symptom:

terraform apply
# Error: InvalidSubnetID.NotFound

Solution:

# Verify subnet exists in correct region
aws ec2 describe-subnets --region us-west-2 --subnet-ids subnet-09a992b9a40683150

# Check tfvars file
cat client_name.tfvars
# Ensure aws_region matches subnet region

# Update if needed
aws_region    = "us-west-2"
aws_subnet_id = "subnet-09a992b9a40683150"
aws_vpc_id    = "vpc-0cf5dfa618d8efd46"

Provisioner Failures

Remote-Exec Fails During init.sh

Symptom:

terraform apply
# Error: remote-exec provisioner error
# Error executing script: exit status 1

Diagnosis:

# Check Terraform output for specific error
terraform apply 2>&1 | tee deploy.log
grep -A 10 "remote-exec" deploy.log

# Common issues:
# - yum update fails (network)
# - Docker install fails (permissions)
# - GitHub token not set

Solution:

# If instance was created but provisioning failed:
# 1. Get instance IP
terraform output -raw apphost_ip

# 2. SSH to instance (if you have the key)
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# 3. Manually run provisioning scripts
chmod +x *.sh
export GITHUB_TOKEN="your-token"
./init.sh
./setup-opendso.sh

# 4. If successful, taint and reapply
terraform taint aws_instance.app_server
terraform apply -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

Script Path Issues

Symptom:

# Error: no such file or directory: ./assets/init.sh

Solution:

# Check script paths in main.tf match actual files
ls -la assets/
ls -la scipts/  # Note: typo in actual deployed main.tf

# Fix paths in main.tf
provisioner "file" {
  source      = "./assets/init.sh"  # Not ./scipts/init.sh
  destination = "init.sh"
}

# Verify all asset files exist
ls -la assets/init.sh
ls -la assets/setup-opendso.sh
ls -la assets/setup-certs.sh
ls -la assets/config.json

EC2 Instance Access Issues

Cannot SSH to Instance

Symptom

ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15
# Connection timed out
# OR
# Connection refused

Diagnosis

1. Check instance is running:

terraform show | grep instance_state
# Should show: "instance_state": "running"

# Or via AWS CLI
aws ec2 describe-instances --instance-ids i-05380585c09e29881

2. Check security group:

# View security group rules
aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2

# Should show port 22 ingress rule

3. Check VPN/Network:

# Instance is in private subnet, requires VPN or jump host
# Verify VPN connection is active

# Test connectivity
ping 10.216.0.15
telnet 10.216.0.15 22

4. Check SSH key permissions:

ls -la ~/.ssh/tf_id_rsa.pem
# Should show: -rw------- (600)

chmod 600 ~/.ssh/tf_id_rsa.pem

Solutions

Add your IP to security group:

# Get your current IP
MY_IP=$(curl -s ifconfig.me)

# Add to security group
aws ec2 authorize-security-group-ingress \
  --group-id sg-08bafdc3e38fd6cf2 \
  --protocol tcp \
  --port 22 \
  --cidr ${MY_IP}/32

Use jump host/bastion:

# SSH through jump host
ssh -i ~/.ssh/tf_id_rsa.pem -J jump-host-user@jump-host-ip ec2-user@10.216.0.15

# Or with ProxyJump in ~/.ssh/config
Host opendso-prod
  HostName 10.216.0.15
  User ec2-user
  IdentityFile ~/.ssh/tf_id_rsa.pem
  ProxyJump jump-host-user@jump-host-ip

Use AWS Systems Manager Session Manager:

# Install Session Manager plugin
# https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

# Connect without SSH
aws ssm start-session --target i-05380585c09e29881

SSH Key Issues

Wrong Key or Permission Denied

Symptom:

ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15
# Permission denied (publickey)

Solution:

# Extract key from Terraform output
terraform output -raw ssh_key > ~/.ssh/tf_id_rsa.pem
chmod 600 ~/.ssh/tf_id_rsa.pem

# Verify key fingerprint matches
ssh-keygen -lf ~/.ssh/tf_id_rsa.pem
aws ec2 describe-key-pairs --key-names tf_id_rsa

# Try with verbose output
ssh -vvv -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15

DNS and Certificate Issues

DNS Not Resolving

Symptom

dig api.client_name.oesinc.dev
# NXDOMAIN or no answer

Diagnosis

# Check DNS records
dig api.client_name.oesinc.dev
dig client_name.oesinc.dev

# Check NS records
dig NS client_name.oesinc.dev

# Check from different DNS server
dig @8.8.8.8 api.client_name.oesinc.dev

Solution

Verify DNS configuration:

# For Google Cloud DNS
gcloud dns record-sets list --zone=<zone-name>

# Check A records point to correct IP
# Should point to instance private IP or jump host public IP

DNS propagation:

# DNS changes can take 5-60 minutes to propagate
# Check propagation
watch -n 10 dig api.client_name.oesinc.dev

# Flush local DNS cache
sudo systemd-resolve --flush-caches  # Linux

Let's Encrypt Certificate Generation Fails

Error: DNS Challenge Failed

Symptom:

./setup-certs.sh
# Challenge failed for domain client_name.oesinc.dev
# DNS problem: query timed out

Diagnosis:

# Check Google Cloud DNS credentials
ls -la ~/.secrets/oes-dev-project-1d0dee6d5d4d.json

# Test credentials
gcloud auth activate-service-account --key-file=~/.secrets/oes-dev-project-1d0dee6d5d4d.json

# Check DNS delegation
dig NS client_name.oesinc.dev
# Should point to Google Cloud DNS nameservers

Solution:

# Ensure DNS is properly delegated to Google Cloud DNS
# At domain registrar, set NS records to:
# ns-cloud-a1.googledomains.com
# ns-cloud-a2.googledomains.com
# etc.

# Verify delegation
dig NS client_name.oesinc.dev

# Wait for propagation (up to 48 hours)
# Retry certificate generation
./setup-certs.sh

Error: Rate Limit Exceeded

Symptom:

# too many certificates already issued for: client_name.oesinc.dev

Solution:

# Let's Encrypt rate limits:
# - 50 certificates per domain per week
# - 5 duplicate certificates per week

# Use staging environment for testing
certbot certonly --staging \
  --dns-google \
  --dns-google-credentials ~/.secrets/cred.json \
  -d client_name.oesinc.dev

# Wait for rate limit to reset (7 days)
# Or use different subdomain for testing

certbot Plugin Not Installed

Symptom:

./setup-certs.sh
# Error: Could not find plugin dns-google

Solution:

# Install certbot and Google DNS plugin
sudo yum install -y certbot python3-certbot-dns-google

# Verify plugins
certbot plugins

# Should show dns-google in list

Service Accessibility Issues

Services Not Accessible

Symptom

Can't reach services via domain names (e.g., https://api.client_name.oesinc.dev)

Diagnosis

1. Check DNS:

dig api.client_name.oesinc.dev
# Should resolve to IP

2. Check containers running:

ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
docker ps
# Should show running containers

3. Check certificates:

ls -la ~/certs/
# Should show certificate files

4. Check security group ports:

aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2
# Should show ports 443, 8080 open

Solution

Add missing security group rules:

# Add HTTPS
aws ec2 authorize-security-group-ingress \
  --group-id sg-08bafdc3e38fd6cf2 \
  --protocol tcp \
  --port 443 \
  --cidr 0.0.0.0/0

# Add HTTP Alt (8080)
aws ec2 authorize-security-group-ingress \
  --group-id sg-08bafdc3e38fd6cf2 \
  --protocol tcp \
  --port 8080 \
  --cidr 0.0.0.0/0

Update Terraform to match actual:

# In main.tf
resource "aws_security_group" "ssh_sg" {
  # Add missing ingress rules
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Certificate Validation Errors

Symptom

Browser shows "NET::ERR_CERT_DATE_INVALID" or "Certificate has expired"

Solution

# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# Check certificate expiration
sudo certbot certificates

# Reissue certificates
./setup-certs.sh

# Restart the containers in place so they reload the new certs
# (cd to wherever the compose project lives on your host; this is
#  ~/opendso/opendso-docker-compose on a standard deployment, but may
#  be ~/opendso or elsewhere depending on how the host was set up)
cd ~/opendso/opendso-docker-compose
docker compose --env-file ../config/docker/.env --profile all restart

Restarting the containers reloads the certificates without destroying anything. Use --profile all so the whole stack restarts together — restarting only a subset of profiles fails with dependency errors, because services restarted on their own still expect the services they depend_on to be part of the same operation. For the full procedure, including verification and cron automation, see Certificate Renewal in the production deployment guide.

Avoid the destroy/recreate path for a routine renewal

Older notes renewed certificates with ./run.sh -p all -d followed by ./run.sh -p all -c. The destroy step runs docker compose down -v, which removes every container and anonymous volume and can erase in-memory state and any unbacked-up database data. Only fall back to that approach if a restart genuinely fails to pick up the new certificates, and back up first — see Database Backup and Restore.

Infrastructure Drift Issues

Detecting Drift

Symptom: Actual infrastructure doesn't match Terraform configuration

Diagnosis:

# Check for drift
terraform plan -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

# Should show "No changes" if no drift
# Shows changes if drift detected

# Refresh state from AWS
terraform refresh -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

# View current state
terraform show

Resolving Drift

Option 1: Update Terraform to Match Reality

# Update main.tf to match actual deployed infrastructure
# Example: Add root_block_device for 20GB volume

resource "aws_instance" "app_server" {
  # ... existing config ...

  root_block_device {
    volume_size = 20  # Match actual
  }
}

# Add missing security group rules
# See production-deployment.md "Infrastructure Drift" section

Option 2: Revert Manual Changes

# Apply Terraform configuration to remove manual changes
terraform apply -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

# WARNING: This may disrupt running services
# Backup data first!

Option 3: Accept Managed Drift

# In main.tf, ignore specific changes
resource "aws_security_group" "ssh_sg" {
  # ... config ...

  lifecycle {
    ignore_changes = [
      ingress,  # Allow manual security group rule changes
    ]
  }
}

Multi-Environment Issues

Wrong Environment Deployed

Symptom

Deployed to wrong AWS region or used wrong tfvars file

Solution

# Check current deployment
terraform show | grep region
terraform show | grep subnet_id

# Verify tfvars file
cat client_name.tfvars

# If wrong, destroy and redeploy
terraform destroy -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
./run.sh -i ./correct-file.tfvars -p -s

Workspace Confusion

Using Terraform Workspaces

# List workspaces
terraform workspace list

# Switch workspace
terraform workspace select oes-test

# Create new workspace
terraform workspace new oes-field

# Deploy to specific workspace
terraform workspace select oes-field
./run.sh -i ./oes-field.tfvars -p -s

State File Conflicts

Symptom

terraform apply
# Error: state file is locked

Solution

# Check who has lock
terraform show

# If lock is stale, force unlock (careful!)
terraform force-unlock <LOCK_ID>

# Use remote state backend to avoid conflicts
# In main.tf:
terraform {
  backend "s3" {
    bucket = "opendso-terraform-state"
    key    = "client_name/terraform.tfstate"
    region = "us-west-2"
  }
}

Storage and Performance Issues

Root Volume Too Small

Symptom

df -h
# /dev/xvda shows 100% usage

Solution

Resize EBS volume:

# 1. Modify volume size via AWS Console or CLI
aws ec2 modify-volume --volume-id vol-02e75bd22ce99980d --size 40

# 2. SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# 3. Extend filesystem
sudo growpart /dev/xvda 1
sudo xfs_growfs -d /  # For xfs filesystem
# OR
sudo resize2fs /dev/xvda1  # For ext4 filesystem

# 4. Verify
df -h

Update Terraform:

# In main.tf
resource "aws_instance" "app_server" {
  # ... existing config ...

  root_block_device {
    volume_size = 40  # Match new size
  }
}

Docker Out of Space

Symptom

df -h
# Root filesystem shows 100% usage

# Operations fail with "no space left on device"
docker exec mongodb mongodump ...
# Error: no space left on device

Diagnosis

1. Check overall disk usage:

df -h
# Identify which filesystem is full

2. Check Docker's disk usage:

docker system df
# Shows usage by images, containers, volumes, and build cache

3. Identify containers with large logs:

# Check total container log size
sudo du -sh /var/lib/docker/containers/

# Find largest container logs (top 20)
sudo sh -c 'cd /var/lib/docker/containers && du -sh */*-json.log 2>/dev/null' | sort -h | tail -20

4. Identify which containers are generating logs:

# Replace <CONTAINER_ID> with the ID from the log file path
docker ps -a --no-trunc | grep <CONTAINER_ID>

Solution

Clean up container logs:

After 4+ months of operation, container logs commonly grow to several GB. The most common culprits are database containers (PostgreSQL, MongoDB, Citus) and services with verbose logging.

# Truncate specific container logs (safe, containers keep running)
# Replace <CONTAINER_ID> with the actual container ID from diagnosis step
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID>/<CONTAINER_ID>-json.log

# Example: Truncate logs for multiple containers
# Find container IDs from the diagnosis output, then truncate each one
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID_1>/<CONTAINER_ID_1>-json.log
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID_2>/<CONTAINER_ID_2>-json.log

# Verify disk space freed
df -h

Configure log rotation to prevent recurrence:

Add logging configuration to your docker-compose.yml file:

services:
  citus-db:
    image: citusdata/citus:12.1.2-alpine
    logging:
      driver: "json-file"
      options:
        max-size: "10m"    # Maximum 10MB per log file
        max-file: "3"      # Keep 3 rotating files (30MB total)
    # ... rest of configuration

  mongodb:
    image: mongo:latest
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
    # ... rest of configuration

  # Apply to other high-volume logging services as needed

Apply the changes:

cd ~/opendso/opendso-docker-compose

# Restart affected containers to apply log rotation
docker compose up -d <service-name>

# Or restart all services
./run.sh -p all -d
./run.sh -p all -c

Alternative: Clean up Docker system:

# Remove unused images, containers, networks, and build cache
# WARNING: This removes stopped containers and unused images
docker system prune -a --volumes

# Check output directory
du -sh ~/output/*

# Clean old application logs
find ~/output -name "*.log" -mtime +30 -delete

Prevention:

Configure log rotation for all services in docker-compose.yml (recommended 10MB × 3 files)
Monitor disk usage regularly: df -h and docker system df
Set up alerts when disk usage exceeds 80%
Automate cleanup: Create a cron job to prune old logs monthly

Monitoring and Logging Issues

How do I View Application Logs?

Solution

# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# Check output directory
ls -la ~/output/

# View logs
tail -f ~/output/<service-name>/app.log

# View Docker logs
docker compose logs -f <service-name>

# View system logs
sudo journalctl -u docker -f

Service Database Connection Failures

Symptom: Historian Service Failing with Database Errors

docker logs historian-svc --tail 50
# FATAL historian: Broken postgres connection
# connection to server at "citus-db" failed: FATAL: database "ofmb_db" does not exist

Diagnosis

1. Check service logs for database errors:

# Check historian logs
docker logs historian-svc | grep -i "database\|error\|connection"

# Check what the service is trying to connect to
docker inspect historian-svc | grep -E "PGUSER|PGDATABASE|PGHOST"

2. List existing databases:

# Connect to the PostgreSQL container and list databases
docker exec citus-db psql -U postgres -c "\l"

3. Verify service configuration:

# Check environment variables for database credentials
cat ~/config/docker/.env | grep -i historian

Root Cause

Missing databases typically indicate one of the following:

Intentional service disablement - The service may be disabled for resource or functional reasons
Incomplete deployment - Database initialization scripts may not have run during initial setup
Configuration mismatch - Service expects a database that wasn't part of the deployment plan

Solution

DO NOT manually create the database - Simply creating an empty database will not set up the required schema, tables, indexes, or user permissions that the service expects. This will lead to additional errors.

Recommended approach:

1. Verify if the service should be running:

# Check with your team or deployment documentation
# Services like historian may be intentionally disabled in some environments

2. If the service should be running, investigate the proper initialization:

# Check if there's an initialization script for the service
ls ~/opendso/opendso-docker-compose/scripts/
ls ~/config/

# Look for database migration or setup scripts
docker exec historian-svc ls /app/ | grep -i init
docker exec historian-svc ls /app/ | grep -i migrate

3. Check if the service auto-creates its database on first run:

# Some services automatically create and initialize their database
# Stop and restart the service to trigger initialization
docker stop historian-svc
docker rm historian-svc

# Recreate the service (from the compose directory)
cd ~/opendso/opendso-docker-compose
./run.sh -p historian -c

# Monitor logs during startup
docker logs -f historian-svc

4. Contact support or consult deployment documentation:

If the service requires database setup, there should be documented procedures for:

Database schema initialization
User permissions setup
Required extensions or configurations

Common missing databases:

ofmb_db - Historian service database for OpenFMB time-series data
Other service-specific databases may vary by deployment

Temporary workaround (not recommended for production):

If you need to temporarily disable a failing service:

cd ~/opendso/opendso-docker-compose
docker stop historian-svc
# The service will remain stopped until explicitly started

Note: Always verify with your team before making database changes. Some services are intentionally disabled, and manually creating databases without proper schema can cause data integrity issues.

Backup and Recovery Issues

MongoDB Backup Fails

Symptom

./run.sh -b
# Error: MongoDB container is not running

Solution

# Check MongoDB is running
docker ps | grep mongodb

# Start if needed
./run.sh -p api -c

# Verify MongoDB connection
docker exec mongodb mongo --eval "db.adminCommand('ping')"

# Retry backup
./run.sh -b

Can't Restore Backup

Symptom

./run.sh -r
# Error: db.dump file does not exist

Solution

# Download backup from S3 (if stored there)
aws s3 cp s3://your-backup-bucket/opendso/db-latest.dump db.dump

# Verify file exists
ls -la db.dump

# Restore
./run.sh -r

Release Archive and setup-opendso.sh Issues

The deployment host pulls versioned opendso.zip, config.zip, and models.zip archives from GitHub releases via setup-opendso.sh. Most update problems on production hosts come from one of: a missing/failed release workflow, a token without access to the release, or stale files left over from the previous deployment. This section covers each.

Note (OES internal): The setup-opendso.sh script that ships on deployed hosts is the same script that lives in the internal IaC repository under assets/setup-compose.sh. If you are working in the IaC project you'll see the setup-compose.sh filename; on every deployed client host the file is setup-opendso.sh. Treat them as the same script. All client-facing references in these docs use setup-opendso.sh.

`setup-opendso.sh` Exits with "Error GITHUB_TOKEN not set"

Symptom:

./setup-opendso.sh
# Error GITHUB_TOKEN not set; exiting

Cause: The script requires GITHUB_TOKEN to be exported in the shell — it exits with code 5 if the variable is empty.

Solution:

export GITHUB_TOKEN="ghp_xxx_your_token"
./setup-opendso.sh

# Or set it inline for a single run
GITHUB_TOKEN="ghp_xxx_your_token" ./setup-opendso.sh

The token must be a GitHub PAT with the repo scope so it can read release assets from the private OES repos.

"Unable to find release for openenergysolutions/`{repo}`"

Symptom:

./setup-opendso.sh
# Unable to find release for openenergysolutions/{client_repo_name}-config-docker-compose

Cause: setup-opendso.sh calls releases/latest and reads .assets[0].id. This message means the GitHub API returned null for the asset id — either the repo has no releases at all, or the most recent release has no attached asset (the tag-archive workflow failed or hasn't run).

Diagnosis:

# Confirm the repo actually has a release
curl -sL \
  -H "Authorization: Bearer $GITHUB_TOKEN" \
  -H "Accept: application/vnd.github+json" \
  https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/latest \
  | jq '{tag_name, name, assets: [.assets[] | {name, size}]}'

If tag_name is missing or assets is empty, the workflow didn't publish a zip.

Solution:

Open the repo's Actions tab on GitHub and find the workflow run for the tag.
If the run failed, fix the underlying error and re-tag (delete the original tag and the partial release first — see the "Tagging Versioned Updates" section of the Production Deployment guide).
If the run never started, confirm the tag was actually pushed (git ls-remote --tags origin). The workflow only triggers on push: tags: '*'.
Once a fresh release with a config.zip / models.zip asset is published, re-run ./setup-opendso.sh on the host.

Token Doesn't Have Access to the Release Asset

Symptom: The script runs without errors but the resulting config.zip / models.zip is tiny (a few hundred bytes) and unzip fails:

./setup-opendso.sh
# Archive:  config.zip
# End-of-central-directory signature not found.

Cause: The first curl to /releases/latest returned an asset id, but the second curl to /releases/assets/{id} returned a JSON error body (saved to config.zip) because the token doesn't have permission to download from a private repo.

Diagnosis:

# Check what was actually downloaded
file config.zip
head -c 500 config.zip
# If it shows JSON like {"message": "Not Found", ...} the token is the problem

Solution: Generate a new PAT with repo scope and confirmed access to the OES org's private repos, then re-run.

`~/config/` or `~/models/` Has Stale Files After Update

Symptom: A new tag was deployed but a file you expected to be gone (or renamed) is still present in ~/config/ or ~/models/.

Cause: setup-opendso.sh runs unzip -o, which overwrites files but does not delete files that exist on disk and are no longer in the archive.

Solution: Wipe the directory before running the script (back it up first if you have host-local edits):

# Stop services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -d

# Backup, then remove
cd ~
mv config config-backup-$(date +%Y%m%d)
mv models models-backup-$(date +%Y%m%d)

# Re-pull
export GITHUB_TOKEN="ghp_xxx"
./setup-opendso.sh

# Confirm the version stamp written by the release workflow
cat ~/config/.version
cat ~/models/.version

# Restart
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c

This is the same pattern used in real client model updates: rename models → models-backup, delete the leftover models.zip, re-run the script, then cd into the new models/ directory.

Wrong Version Deployed (Need to Roll Back)

Symptom: setup-opendso.sh pulled a tag that broke the deployment, and you need the previous tag back.

Cause: The script always pulls releases/latest — there is no per-tag pinning in config.json. Whatever release is most recent on GitHub is what lands on the host.

Solution options:

Re-publish the previous tag as the latest release. On GitHub, edit the older release and either re-publish it (which updates its published_at timestamp and makes it latest) or delete the bad release entirely. Then re-run setup-opendso.sh on the host.

Manually download the older asset. From the host, hit the GitHub API for the specific tag instead of releases/latest:

# Find the tagged release
curl -sL \
  -H "Authorization: Bearer $GITHUB_TOKEN" \
  https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/tags/1.3.9 \
  | jq '.assets[0].id'

# Download by asset id (replace ASSET_ID)
curl -sL \
  -H "Accept: application/octet-stream" \
  -H "Authorization: Bearer $GITHUB_TOKEN" \
  https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/assets/ASSET_ID \
  --output config.zip

rm -rf config && unzip -o config.zip -d config

Cut a forward-rolling fix tag (e.g. tag 1.4.2 containing the contents of 1.3.9). This keeps the version history monotonic and avoids confusing future deploys.

`config.json` Missing or Malformed

Symptom:

./setup-opendso.sh
# parse error: Invalid numeric literal at line 1, column 5
# OR
# No releases found in config.json

Cause: config.json next to the script is either missing, not valid JSON, or has an empty releases array.

Solution:

# Verify the file exists and parses
ls -la ~/config.json
jq . ~/config.json

# Expected shape
# {
#   "organization": "openenergysolutions",
#   "releases": [
#     { "repositoryName": "opendso-docker-compose",                   "displayName": "opendso" },
#     { "repositoryName": "{client_repo_name}-docker-compose",        "displayName": "models"  },
#     { "repositoryName": "{client_repo_name}-config-docker-compose", "displayName": "config"  }
#   ]
# }

If the file was lost, restore it from the IaC repo (the file ships alongside the script in the deployment provisioning assets).

Update and Upgrade Issues

Component Update Fails

Symptom

New OpenDSO release won't download or deploy

Solution

# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# Stop services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -d

# Re-run setup with valid token
export GITHUB_TOKEN="your-token"
cd ~
./setup-opendso.sh

# Check downloads succeeded
ls -la ~/opendso/
ls -la ~/config/
ls -la ~/models/

# Confirm the version stamp written by the release workflow
cat ~/config/.version
cat ~/models/.version

# Restart services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c

System Package Update Breaks Docker

Solution

# If Docker stops working after yum update
sudo systemctl status docker

# Reinstall if needed
sudo yum reinstall docker

# Restart
sudo systemctl restart docker

# Verify
docker ps

Emergency Procedures

Complete Service Outage

Recovery Steps:

# 1. Check instance status
aws ec2 describe-instances --instance-ids i-05380585c09e29881

# 2. If stopped, start it
aws ec2 start-instances --instance-ids i-05380585c09e29881

# 3. SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>

# 4. Check Docker
sudo systemctl status docker
sudo systemctl start docker

# 5. Check containers
docker ps -a

# 6. Restart services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c

# 7. Monitor logs
docker compose logs -f

Disaster Recovery

Full Rebuild:

# 1. Backup any critical data
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
cd ~/opendso/opendso-docker-compose
./run.sh -b
aws s3 cp db.dump s3://backup-bucket/emergency-backup-$(date +%Y%m%d).dump

# 2. Destroy infrastructure
terraform destroy -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

# 3. Redeploy
./run.sh -i ./client_name.tfvars -p -s

# 4. Restore data
# (SSH to new instance and restore db.dump)

Best Practices for Production

Use Remote State: Store Terraform state in S3
Regular Backups: Automate MongoDB backups to S3
Monitoring: Set up CloudWatch alarms
Documentation: Document manual changes immediately
Testing: Test in lab before deploying to field
Security: Restrict security groups to specific IPs
Updates: Keep Terraform config in sync with reality
Certificates: Automate renewal with cron
Logs: Implement log rotation
Disaster Recovery: Test recovery procedures regularly

Quick Reference

Common Commands

# Check instance status
aws ec2 describe-instances --instance-ids i-05380585c09e29881

# View Terraform outputs
terraform output
terraform output -raw apphost_ip

# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@$(terraform output -raw apphost_ip)

# Check drift
terraform plan -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"

# View state
terraform show

# Check security groups
aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2

# Test DNS
dig api.client_name.oesinc.dev

# Check certificates
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<IP> "sudo certbot certificates"

Next Steps

For Docker issues, see Docker Troubleshooting
For local issues, see Local Troubleshooting
Return to Troubleshooting Overview
See the Production Deployment Guide

Terraform Deployment Issues​

Terraform Apply Fails​

Error: Unable to Download Releases​

Error: Insufficient AWS Permissions​

Error: VPC/Subnet Not Found​

Provisioner Failures​

Remote-Exec Fails During init.sh​

Script Path Issues​

EC2 Instance Access Issues​

Cannot SSH to Instance​

Symptom​

Diagnosis​

Solutions​

SSH Key Issues​

Wrong Key or Permission Denied​

DNS and Certificate Issues​

DNS Not Resolving​

Symptom​

Diagnosis​

Solution​

Let's Encrypt Certificate Generation Fails​

Error: DNS Challenge Failed​

Error: Rate Limit Exceeded​

certbot Plugin Not Installed​

Service Accessibility Issues​

Services Not Accessible​

Symptom​

Diagnosis​

Solution​

Certificate Validation Errors​

Symptom​

Solution​

Infrastructure Drift Issues​

Detecting Drift​

Resolving Drift​

Option 1: Update Terraform to Match Reality​

Option 2: Revert Manual Changes​

Option 3: Accept Managed Drift​

Multi-Environment Issues​

Wrong Environment Deployed​

Symptom​

Solution​

Workspace Confusion​

Using Terraform Workspaces​

State File Conflicts​

Symptom​

Solution​

Storage and Performance Issues​

Root Volume Too Small​

Symptom​

Solution​

Docker Out of Space​

Symptom​

Diagnosis​

Solution​

Monitoring and Logging Issues​

How do I View Application Logs?​

Solution​

Service Database Connection Failures​

Symptom: Historian Service Failing with Database Errors​

Diagnosis​

Root Cause​

Solution​

Backup and Recovery Issues​

MongoDB Backup Fails​

Symptom​

Solution​

Can't Restore Backup​

Symptom​

Solution​

Release Archive and setup-opendso.sh Issues​

setup-opendso.sh Exits with "Error GITHUB_TOKEN not set"​

"Unable to find release for openenergysolutions/{repo}"​

Token Doesn't Have Access to the Release Asset​

~/config/ or ~/models/ Has Stale Files After Update​

Wrong Version Deployed (Need to Roll Back)​

config.json Missing or Malformed​

Update and Upgrade Issues​

Component Update Fails​

Symptom​

Terraform Deployment Issues

Terraform Apply Fails

Error: Unable to Download Releases

Error: Insufficient AWS Permissions

Error: VPC/Subnet Not Found

Provisioner Failures

Remote-Exec Fails During init.sh

Script Path Issues

EC2 Instance Access Issues

Cannot SSH to Instance

Symptom

Diagnosis

Solutions

SSH Key Issues

Wrong Key or Permission Denied

DNS and Certificate Issues

DNS Not Resolving

Symptom

Diagnosis

Solution

Let's Encrypt Certificate Generation Fails

Error: DNS Challenge Failed

Error: Rate Limit Exceeded

certbot Plugin Not Installed

Service Accessibility Issues

Services Not Accessible

Symptom

Diagnosis

Solution

Certificate Validation Errors

Symptom

Solution

Infrastructure Drift Issues

Detecting Drift

Resolving Drift

Option 1: Update Terraform to Match Reality

Option 2: Revert Manual Changes

Option 3: Accept Managed Drift

Multi-Environment Issues

Wrong Environment Deployed

Symptom

Solution

Workspace Confusion

Using Terraform Workspaces

State File Conflicts

Symptom

Solution

Storage and Performance Issues

Root Volume Too Small

Symptom

Solution

Docker Out of Space

Symptom

Diagnosis

Solution

Monitoring and Logging Issues

How do I View Application Logs?

Solution

Service Database Connection Failures

Symptom: Historian Service Failing with Database Errors

Diagnosis

Root Cause

Solution

Backup and Recovery Issues

MongoDB Backup Fails

Symptom

Solution

Can't Restore Backup

Symptom

Solution

Release Archive and setup-opendso.sh Issues

`setup-opendso.sh` Exits with "Error GITHUB_TOKEN not set"

"Unable to find release for openenergysolutions/`{repo}`"

Token Doesn't Have Access to the Release Asset

`~/config/` or `~/models/` Has Stale Files After Update

Wrong Version Deployed (Need to Roll Back)

`config.json` Missing or Malformed

Update and Upgrade Issues

Component Update Fails

Symptom