Production Deployment Troubleshooting (AWS/Lab/Field)
This guide covers troubleshooting issues specific to production deployments on AWS EC2 using Terraform, including Lab and Field environments.
Terraform Deployment Issues
Terraform Apply Fails
Error: Unable to Download Releases
Symptom:
./run.sh -i ./client_name.tfvars -p -s
# Error downloading release assets from GitHub
Diagnosis:
# Test GitHub token manually
curl -H "Authorization: Bearer $GITHUB_TOKEN" \
https://api.github.com/repos/openenergysolutions/opendso-docker-compose/releases/latest
# Check token permissions
# Should have "repo" scope for private repositories
Solution:
# Generate new GitHub PAT with correct permissions
# Go to: GitHub Settings → Developer settings → Personal access tokens
# Test new token
export GITHUB_TOKEN="your-new-token"
curl -H "Authorization: Bearer $GITHUB_TOKEN" \
https://api.github.com/repos/openenergysolutions/opendso-docker-compose/releases/latest
# Retry deployment
./run.sh -i ./client_name.tfvars -p -s
Error: Insufficient AWS Permissions
Symptom:
terraform apply
# Error: UnauthorizedOperation: You are not authorized to perform this operation
Required IAM Permissions:
- EC2: RunInstances, DescribeInstances, TerminateInstances
- EC2: CreateSecurityGroup, AuthorizeSecurityGroupIngress, DeleteSecurityGroup
- EC2: CreateKeyPair, DeleteKeyPair, DescribeKeyPairs
- VPC: DescribeVpcs, DescribeSubnets
Solution:
# Check AWS CLI credentials
aws sts get-caller-identity
# Verify IAM permissions
aws iam get-user-policy --user-name your-username --policy-name your-policy
# Use appropriate AWS profile
export AWS_PROFILE=opendso-admin
aws configure list
# Or set credentials explicitly
aws configure
Error: VPC/Subnet Not Found
Symptom:
terraform apply
# Error: InvalidSubnetID.NotFound
Solution:
# Verify subnet exists in correct region
aws ec2 describe-subnets --region us-west-2 --subnet-ids subnet-09a992b9a40683150
# Check tfvars file
cat client_name.tfvars
# Ensure aws_region matches subnet region
# Update if needed
aws_region = "us-west-2"
aws_subnet_id = "subnet-09a992b9a40683150"
aws_vpc_id = "vpc-0cf5dfa618d8efd46"
Provisioner Failures
Remote-Exec Fails During init.sh
Symptom:
terraform apply
# Error: remote-exec provisioner error
# Error executing script: exit status 1
Diagnosis:
# Check Terraform output for specific error
terraform apply 2>&1 | tee deploy.log
grep -A 10 "remote-exec" deploy.log
# Common issues:
# - yum update fails (network)
# - Docker install fails (permissions)
# - GitHub token not set
Solution:
# If instance was created but provisioning failed:
# 1. Get instance IP
terraform output -raw apphost_ip
# 2. SSH to instance (if you have the key)
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# 3. Manually run provisioning scripts
chmod +x *.sh
export GITHUB_TOKEN="your-token"
./init.sh
./setup-opendso.sh
# 4. If successful, taint and reapply
terraform taint aws_instance.app_server
terraform apply -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
Script Path Issues
Symptom:
# Error: no such file or directory: ./assets/init.sh
Solution:
# Check script paths in main.tf match actual files
ls -la assets/
ls -la scipts/ # Note: typo in actual deployed main.tf
# Fix paths in main.tf
provisioner "file" {
source = "./assets/init.sh" # Not ./scipts/init.sh
destination = "init.sh"
}
# Verify all asset files exist
ls -la assets/init.sh
ls -la assets/setup-opendso.sh
ls -la assets/setup-certs.sh
ls -la assets/config.json
EC2 Instance Access Issues
Cannot SSH to Instance
Symptom
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15
# Connection timed out
# OR
# Connection refused
Diagnosis
1. Check instance is running:
terraform show | grep instance_state
# Should show: "instance_state": "running"
# Or via AWS CLI
aws ec2 describe-instances --instance-ids i-05380585c09e29881
2. Check security group:
# View security group rules
aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2
# Should show port 22 ingress rule
3. Check VPN/Network:
# Instance is in private subnet, requires VPN or jump host
# Verify VPN connection is active
# Test connectivity
ping 10.216.0.15
telnet 10.216.0.15 22
4. Check SSH key permissions:
ls -la ~/.ssh/tf_id_rsa.pem
# Should show: -rw------- (600)
chmod 600 ~/.ssh/tf_id_rsa.pem
Solutions
Add your IP to security group:
# Get your current IP
MY_IP=$(curl -s ifconfig.me)
# Add to security group
aws ec2 authorize-security-group-ingress \
--group-id sg-08bafdc3e38fd6cf2 \
--protocol tcp \
--port 22 \
--cidr ${MY_IP}/32
Use jump host/bastion:
# SSH through jump host
ssh -i ~/.ssh/tf_id_rsa.pem -J jump-host-user@jump-host-ip ec2-user@10.216.0.15
# Or with ProxyJump in ~/.ssh/config
Host opendso-prod
HostName 10.216.0.15
User ec2-user
IdentityFile ~/.ssh/tf_id_rsa.pem
ProxyJump jump-host-user@jump-host-ip
Use AWS Systems Manager Session Manager:
# Install Session Manager plugin
# https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
# Connect without SSH
aws ssm start-session --target i-05380585c09e29881
SSH Key Issues
Wrong Key or Permission Denied
Symptom:
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15
# Permission denied (publickey)
Solution:
# Extract key from Terraform output
terraform output -raw ssh_key > ~/.ssh/tf_id_rsa.pem
chmod 600 ~/.ssh/tf_id_rsa.pem
# Verify key fingerprint matches
ssh-keygen -lf ~/.ssh/tf_id_rsa.pem
aws ec2 describe-key-pairs --key-names tf_id_rsa
# Try with verbose output
ssh -vvv -i ~/.ssh/tf_id_rsa.pem ec2-user@10.216.0.15
DNS and Certificate Issues
DNS Not Resolving
Symptom
dig api.client_name.oesinc.dev
# NXDOMAIN or no answer
Diagnosis
# Check DNS records
dig api.client_name.oesinc.dev
dig client_name.oesinc.dev
# Check NS records
dig NS client_name.oesinc.dev
# Check from different DNS server
dig @8.8.8.8 api.client_name.oesinc.dev
Solution
Verify DNS configuration:
# For Google Cloud DNS
gcloud dns record-sets list --zone=<zone-name>
# Check A records point to correct IP
# Should point to instance private IP or jump host public IP
DNS propagation:
# DNS changes can take 5-60 minutes to propagate
# Check propagation
watch -n 10 dig api.client_name.oesinc.dev
# Flush local DNS cache
sudo systemd-resolve --flush-caches # Linux
Let's Encrypt Certificate Generation Fails
Error: DNS Challenge Failed
Symptom:
./setup-certs.sh
# Challenge failed for domain client_name.oesinc.dev
# DNS problem: query timed out
Diagnosis:
# Check Google Cloud DNS credentials
ls -la ~/.secrets/oes-dev-project-1d0dee6d5d4d.json
# Test credentials
gcloud auth activate-service-account --key-file=~/.secrets/oes-dev-project-1d0dee6d5d4d.json
# Check DNS delegation
dig NS client_name.oesinc.dev
# Should point to Google Cloud DNS nameservers
Solution:
# Ensure DNS is properly delegated to Google Cloud DNS
# At domain registrar, set NS records to:
# ns-cloud-a1.googledomains.com
# ns-cloud-a2.googledomains.com
# etc.
# Verify delegation
dig NS client_name.oesinc.dev
# Wait for propagation (up to 48 hours)
# Retry certificate generation
./setup-certs.sh
Error: Rate Limit Exceeded
Symptom:
# too many certificates already issued for: client_name.oesinc.dev
Solution:
# Let's Encrypt rate limits:
# - 50 certificates per domain per week
# - 5 duplicate certificates per week
# Use staging environment for testing
certbot certonly --staging \
--dns-google \
--dns-google-credentials ~/.secrets/cred.json \
-d client_name.oesinc.dev
# Wait for rate limit to reset (7 days)
# Or use different subdomain for testing
certbot Plugin Not Installed
Symptom:
./setup-certs.sh
# Error: Could not find plugin dns-google
Solution:
# Install certbot and Google DNS plugin
sudo yum install -y certbot python3-certbot-dns-google
# Verify plugins
certbot plugins
# Should show dns-google in list
Service Accessibility Issues
Services Not Accessible
Symptom
Can't reach services via domain names (e.g., https://api.client_name.oesinc.dev)
Diagnosis
1. Check DNS:
dig api.client_name.oesinc.dev
# Should resolve to IP
2. Check containers running:
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
docker ps
# Should show running containers
3. Check certificates:
ls -la ~/certs/
# Should show certificate files
4. Check security group ports:
aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2
# Should show ports 443, 8080 open
Solution
Add missing security group rules:
# Add HTTPS
aws ec2 authorize-security-group-ingress \
--group-id sg-08bafdc3e38fd6cf2 \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
# Add HTTP Alt (8080)
aws ec2 authorize-security-group-ingress \
--group-id sg-08bafdc3e38fd6cf2 \
--protocol tcp \
--port 8080 \
--cidr 0.0.0.0/0
Update Terraform to match actual:
# In main.tf
resource "aws_security_group" "ssh_sg" {
# Add missing ingress rules
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 8080
to_port = 8080
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
Certificate Validation Errors
Symptom
Browser shows "NET::ERR_CERT_DATE_INVALID" or "Certificate has expired"
Solution
# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# Check certificate expiration
sudo certbot certificates
# Renew certificates
./setup-certs.sh
# Restart services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -d
./run.sh -p all -c
Infrastructure Drift Issues
Detecting Drift
Symptom: Actual infrastructure doesn't match Terraform configuration
Diagnosis:
# Check for drift
terraform plan -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
# Should show "No changes" if no drift
# Shows changes if drift detected
# Refresh state from AWS
terraform refresh -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
# View current state
terraform show
Resolving Drift
Option 1: Update Terraform to Match Reality
# Update main.tf to match actual deployed infrastructure
# Example: Add root_block_device for 20GB volume
resource "aws_instance" "app_server" {
# ... existing config ...
root_block_device {
volume_size = 20 # Match actual
}
}
# Add missing security group rules
# See production-deployment.md "Infrastructure Drift" section
Option 2: Revert Manual Changes
# Apply Terraform configuration to remove manual changes
terraform apply -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
# WARNING: This may disrupt running services
# Backup data first!
Option 3: Accept Managed Drift
# In main.tf, ignore specific changes
resource "aws_security_group" "ssh_sg" {
# ... config ...
lifecycle {
ignore_changes = [
ingress, # Allow manual security group rule changes
]
}
}
Multi-Environment Issues
Wrong Environment Deployed
Symptom
Deployed to wrong AWS region or used wrong tfvars file
Solution
# Check current deployment
terraform show | grep region
terraform show | grep subnet_id
# Verify tfvars file
cat client_name.tfvars
# If wrong, destroy and redeploy
terraform destroy -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
./run.sh -i ./correct-file.tfvars -p -s
Workspace Confusion
Using Terraform Workspaces
# List workspaces
terraform workspace list
# Switch workspace
terraform workspace select oes-test
# Create new workspace
terraform workspace new oes-field
# Deploy to specific workspace
terraform workspace select oes-field
./run.sh -i ./oes-field.tfvars -p -s
State File Conflicts
Symptom
terraform apply
# Error: state file is locked
Solution
# Check who has lock
terraform show
# If lock is stale, force unlock (careful!)
terraform force-unlock <LOCK_ID>
# Use remote state backend to avoid conflicts
# In main.tf:
terraform {
backend "s3" {
bucket = "opendso-terraform-state"
key = "client_name/terraform.tfstate"
region = "us-west-2"
}
}
Storage and Performance Issues
Root Volume Too Small
Symptom
df -h
# /dev/xvda shows 100% usage
Solution
Resize EBS volume:
# 1. Modify volume size via AWS Console or CLI
aws ec2 modify-volume --volume-id vol-02e75bd22ce99980d --size 40
# 2. SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# 3. Extend filesystem
sudo growpart /dev/xvda 1
sudo xfs_growfs -d / # For xfs filesystem
# OR
sudo resize2fs /dev/xvda1 # For ext4 filesystem
# 4. Verify
df -h
Update Terraform:
# In main.tf
resource "aws_instance" "app_server" {
# ... existing config ...
root_block_device {
volume_size = 40 # Match new size
}
}
Docker Out of Space
Symptom
df -h
# Root filesystem shows 100% usage
# Operations fail with "no space left on device"
docker exec mongodb mongodump ...
# Error: no space left on device
Diagnosis
1. Check overall disk usage:
df -h
# Identify which filesystem is full
2. Check Docker's disk usage:
docker system df
# Shows usage by images, containers, volumes, and build cache
3. Identify containers with large logs:
# Check total container log size
sudo du -sh /var/lib/docker/containers/
# Find largest container logs (top 20)
sudo sh -c 'cd /var/lib/docker/containers && du -sh */*-json.log 2>/dev/null' | sort -h | tail -20
4. Identify which containers are generating logs:
# Replace <CONTAINER_ID> with the ID from the log file path
docker ps -a --no-trunc | grep <CONTAINER_ID>
Solution
Clean up container logs:
After 4+ months of operation, container logs commonly grow to several GB. The most common culprits are database containers (PostgreSQL, MongoDB, Citus) and services with verbose logging.
# Truncate specific container logs (safe, containers keep running)
# Replace <CONTAINER_ID> with the actual container ID from diagnosis step
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID>/<CONTAINER_ID>-json.log
# Example: Truncate logs for multiple containers
# Find container IDs from the diagnosis output, then truncate each one
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID_1>/<CONTAINER_ID_1>-json.log
sudo truncate -s 0 /var/lib/docker/containers/<CONTAINER_ID_2>/<CONTAINER_ID_2>-json.log
# Verify disk space freed
df -h
Configure log rotation to prevent recurrence:
Add logging configuration to your docker-compose.yml file:
services:
citus-db:
image: citusdata/citus:12.1.2-alpine
logging:
driver: "json-file"
options:
max-size: "10m" # Maximum 10MB per log file
max-file: "3" # Keep 3 rotating files (30MB total)
# ... rest of configuration
mongodb:
image: mongo:latest
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
# ... rest of configuration
# Apply to other high-volume logging services as needed
Apply the changes:
cd ~/opendso/opendso-docker-compose
# Restart affected containers to apply log rotation
docker compose up -d <service-name>
# Or restart all services
./run.sh -p all -d
./run.sh -p all -c
Alternative: Clean up Docker system:
# Remove unused images, containers, networks, and build cache
# WARNING: This removes stopped containers and unused images
docker system prune -a --volumes
# Check output directory
du -sh ~/output/*
# Clean old application logs
find ~/output -name "*.log" -mtime +30 -delete
Prevention:
- Configure log rotation for all services in docker-compose.yml (recommended 10MB × 3 files)
- Monitor disk usage regularly:
df -handdocker system df - Set up alerts when disk usage exceeds 80%
- Automate cleanup: Create a cron job to prune old logs monthly
Monitoring and Logging Issues
How do I View Application Logs?
Solution
# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# Check output directory
ls -la ~/output/
# View logs
tail -f ~/output/<service-name>/app.log
# View Docker logs
docker compose logs -f <service-name>
# View system logs
sudo journalctl -u docker -f
Service Database Connection Failures
Symptom: Historian Service Failing with Database Errors
docker logs historian-svc --tail 50
# FATAL historian: Broken postgres connection
# connection to server at "citus-db" failed: FATAL: database "ofmb_db" does not exist
Diagnosis
1. Check service logs for database errors:
# Check historian logs
docker logs historian-svc | grep -i "database\|error\|connection"
# Check what the service is trying to connect to
docker inspect historian-svc | grep -E "PGUSER|PGDATABASE|PGHOST"
2. List existing databases:
# Connect to the PostgreSQL container and list databases
docker exec citus-db psql -U postgres -c "\l"
3. Verify service configuration:
# Check environment variables for database credentials
cat ~/config/docker/.env | grep -i historian
Root Cause
Missing databases typically indicate one of the following:
- Intentional service disablement - The service may be disabled for resource or functional reasons
- Incomplete deployment - Database initialization scripts may not have run during initial setup
- Configuration mismatch - Service expects a database that wasn't part of the deployment plan
Solution
DO NOT manually create the database - Simply creating an empty database will not set up the required schema, tables, indexes, or user permissions that the service expects. This will lead to additional errors.
Recommended approach:
1. Verify if the service should be running:
# Check with your team or deployment documentation
# Services like historian may be intentionally disabled in some environments
2. If the service should be running, investigate the proper initialization:
# Check if there's an initialization script for the service
ls ~/opendso/opendso-docker-compose/scripts/
ls ~/config/
# Look for database migration or setup scripts
docker exec historian-svc ls /app/ | grep -i init
docker exec historian-svc ls /app/ | grep -i migrate
3. Check if the service auto-creates its database on first run:
# Some services automatically create and initialize their database
# Stop and restart the service to trigger initialization
docker stop historian-svc
docker rm historian-svc
# Recreate the service (from the compose directory)
cd ~/opendso/opendso-docker-compose
./run.sh -p historian -c
# Monitor logs during startup
docker logs -f historian-svc
4. Contact support or consult deployment documentation:
If the service requires database setup, there should be documented procedures for:
- Database schema initialization
- User permissions setup
- Required extensions or configurations
Common missing databases:
ofmb_db- Historian service database for OpenFMB time-series data- Other service-specific databases may vary by deployment
Temporary workaround (not recommended for production):
If you need to temporarily disable a failing service:
cd ~/opendso/opendso-docker-compose
docker stop historian-svc
# The service will remain stopped until explicitly started
Note: Always verify with your team before making database changes. Some services are intentionally disabled, and manually creating databases without proper schema can cause data integrity issues.
Backup and Recovery Issues
MongoDB Backup Fails
Symptom
./run.sh -b
# Error: MongoDB container is not running
Solution
# Check MongoDB is running
docker ps | grep mongodb
# Start if needed
./run.sh -p api -c
# Verify MongoDB connection
docker exec mongodb mongo --eval "db.adminCommand('ping')"
# Retry backup
./run.sh -b
Can't Restore Backup
Symptom
./run.sh -r
# Error: db.dump file does not exist
Solution
# Download backup from S3 (if stored there)
aws s3 cp s3://your-backup-bucket/opendso/db-latest.dump db.dump
# Verify file exists
ls -la db.dump
# Restore
./run.sh -r
Release Archive and setup-opendso.sh Issues
The deployment host pulls versioned opendso.zip, config.zip, and models.zip archives from GitHub releases via setup-opendso.sh. Most update problems on production hosts come from one of: a missing/failed release workflow, a token without access to the release, or stale files left over from the previous deployment. This section covers each.
Note (OES internal): The
setup-opendso.shscript that ships on deployed hosts is the same script that lives in the internal IaC repository underassets/setup-compose.sh. If you are working in the IaC project you'll see thesetup-compose.shfilename; on every deployed client host the file issetup-opendso.sh. Treat them as the same script. All client-facing references in these docs usesetup-opendso.sh.
setup-opendso.sh Exits with "Error GITHUB_TOKEN not set"
Symptom:
./setup-opendso.sh
# Error GITHUB_TOKEN not set; exiting
Cause: The script requires GITHUB_TOKEN to be exported in the shell — it exits with code 5 if the variable is empty.
Solution:
export GITHUB_TOKEN="ghp_xxx_your_token"
./setup-opendso.sh
# Or set it inline for a single run
GITHUB_TOKEN="ghp_xxx_your_token" ./setup-opendso.sh
The token must be a GitHub PAT with the repo scope so it can read release assets from the private OES repos.
"Unable to find release for openenergysolutions/{repo}"
Symptom:
./setup-opendso.sh
# Unable to find release for openenergysolutions/{client_repo_name}-config-docker-compose
Cause: setup-opendso.sh calls releases/latest and reads .assets[0].id. This message means the GitHub API returned null for the asset id — either the repo has no releases at all, or the most recent release has no attached asset (the tag-archive workflow failed or hasn't run).
Diagnosis:
# Confirm the repo actually has a release
curl -sL \
-H "Authorization: Bearer $GITHUB_TOKEN" \
-H "Accept: application/vnd.github+json" \
https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/latest \
| jq '{tag_name, name, assets: [.assets[] | {name, size}]}'
If tag_name is missing or assets is empty, the workflow didn't publish a zip.
Solution:
- Open the repo's Actions tab on GitHub and find the workflow run for the tag.
- If the run failed, fix the underlying error and re-tag (delete the original tag and the partial release first — see the "Tagging Versioned Updates" section of the Production Deployment guide).
- If the run never started, confirm the tag was actually pushed (
git ls-remote --tags origin). The workflow only triggers onpush: tags: '*'. - Once a fresh release with a
config.zip/models.zipasset is published, re-run./setup-opendso.shon the host.
Token Doesn't Have Access to the Release Asset
Symptom: The script runs without errors but the resulting config.zip / models.zip is tiny (a few hundred bytes) and unzip fails:
./setup-opendso.sh
# Archive: config.zip
# End-of-central-directory signature not found.
Cause: The first curl to /releases/latest returned an asset id, but the second curl to /releases/assets/{id} returned a JSON error body (saved to config.zip) because the token doesn't have permission to download from a private repo.
Diagnosis:
# Check what was actually downloaded
file config.zip
head -c 500 config.zip
# If it shows JSON like {"message": "Not Found", ...} the token is the problem
Solution: Generate a new PAT with repo scope and confirmed access to the OES org's private repos, then re-run.
~/config/ or ~/models/ Has Stale Files After Update
Symptom: A new tag was deployed but a file you expected to be gone (or renamed) is still present in ~/config/ or ~/models/.
Cause: setup-opendso.sh runs unzip -o, which overwrites files but does not delete files that exist on disk and are no longer in the archive.
Solution: Wipe the directory before running the script (back it up first if you have host-local edits):
# Stop services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -d
# Backup, then remove
cd ~
mv config config-backup-$(date +%Y%m%d)
mv models models-backup-$(date +%Y%m%d)
# Re-pull
export GITHUB_TOKEN="ghp_xxx"
./setup-opendso.sh
# Confirm the version stamp written by the release workflow
cat ~/config/.version
cat ~/models/.version
# Restart
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c
This is the same pattern used in real client model updates: rename models → models-backup, delete the leftover models.zip, re-run the script, then cd into the new models/ directory.
Wrong Version Deployed (Need to Roll Back)
Symptom: setup-opendso.sh pulled a tag that broke the deployment, and you need the previous tag back.
Cause: The script always pulls releases/latest — there is no per-tag pinning in config.json. Whatever release is most recent on GitHub is what lands on the host.
Solution options:
- Re-publish the previous tag as the latest release. On GitHub, edit the older release and either re-publish it (which updates its
published_attimestamp and makes itlatest) or delete the bad release entirely. Then re-runsetup-opendso.shon the host. - Manually download the older asset. From the host, hit the GitHub API for the specific tag instead of
releases/latest:# Find the tagged release
curl -sL \
-H "Authorization: Bearer $GITHUB_TOKEN" \
https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/tags/1.3.9 \
| jq '.assets[0].id'
# Download by asset id (replace ASSET_ID)
curl -sL \
-H "Accept: application/octet-stream" \
-H "Authorization: Bearer $GITHUB_TOKEN" \
https://api.github.com/repos/openenergysolutions/{client_repo_name}-config-docker-compose/releases/assets/ASSET_ID \
--output config.zip
rm -rf config && unzip -o config.zip -d config - Cut a forward-rolling fix tag (e.g. tag
1.4.2containing the contents of1.3.9). This keeps the version history monotonic and avoids confusing future deploys.
config.json Missing or Malformed
Symptom:
./setup-opendso.sh
# parse error: Invalid numeric literal at line 1, column 5
# OR
# No releases found in config.json
Cause: config.json next to the script is either missing, not valid JSON, or has an empty releases array.
Solution:
# Verify the file exists and parses
ls -la ~/config.json
jq . ~/config.json
# Expected shape
# {
# "organization": "openenergysolutions",
# "releases": [
# { "repositoryName": "opendso-docker-compose", "displayName": "opendso" },
# { "repositoryName": "{client_repo_name}-docker-compose", "displayName": "models" },
# { "repositoryName": "{client_repo_name}-config-docker-compose", "displayName": "config" }
# ]
# }
If the file was lost, restore it from the IaC repo (the file ships alongside the script in the deployment provisioning assets).
Update and Upgrade Issues
Component Update Fails
Symptom
New OpenDSO release won't download or deploy
Solution
# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# Stop services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -d
# Re-run setup with valid token
export GITHUB_TOKEN="your-token"
cd ~
./setup-opendso.sh
# Check downloads succeeded
ls -la ~/opendso/
ls -la ~/config/
ls -la ~/models/
# Confirm the version stamp written by the release workflow
cat ~/config/.version
cat ~/models/.version
# Restart services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c
System Package Update Breaks Docker
Solution
# If Docker stops working after yum update
sudo systemctl status docker
# Reinstall if needed
sudo yum reinstall docker
# Restart
sudo systemctl restart docker
# Verify
docker ps
Emergency Procedures
Complete Service Outage
Recovery Steps:
# 1. Check instance status
aws ec2 describe-instances --instance-ids i-05380585c09e29881
# 2. If stopped, start it
aws ec2 start-instances --instance-ids i-05380585c09e29881
# 3. SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
# 4. Check Docker
sudo systemctl status docker
sudo systemctl start docker
# 5. Check containers
docker ps -a
# 6. Restart services
cd ~/opendso/opendso-docker-compose
./run.sh -p all -c
# 7. Monitor logs
docker compose logs -f
Disaster Recovery
Full Rebuild:
# 1. Backup any critical data
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<PRIVATE_IP>
cd ~/opendso/opendso-docker-compose
./run.sh -b
aws s3 cp db.dump s3://backup-bucket/emergency-backup-$(date +%Y%m%d).dump
# 2. Destroy infrastructure
terraform destroy -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
# 3. Redeploy
./run.sh -i ./client_name.tfvars -p -s
# 4. Restore data
# (SSH to new instance and restore db.dump)
Best Practices for Production
- Use Remote State: Store Terraform state in S3
- Regular Backups: Automate MongoDB backups to S3
- Monitoring: Set up CloudWatch alarms
- Documentation: Document manual changes immediately
- Testing: Test in lab before deploying to field
- Security: Restrict security groups to specific IPs
- Updates: Keep Terraform config in sync with reality
- Certificates: Automate renewal with cron
- Logs: Implement log rotation
- Disaster Recovery: Test recovery procedures regularly
Quick Reference
Common Commands
# Check instance status
aws ec2 describe-instances --instance-ids i-05380585c09e29881
# View Terraform outputs
terraform output
terraform output -raw apphost_ip
# SSH to instance
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@$(terraform output -raw apphost_ip)
# Check drift
terraform plan -var-file="./client_name.tfvars" -var="github_token=$GITHUB_TOKEN"
# View state
terraform show
# Check security groups
aws ec2 describe-security-groups --group-ids sg-08bafdc3e38fd6cf2
# Test DNS
dig api.client_name.oesinc.dev
# Check certificates
ssh -i ~/.ssh/tf_id_rsa.pem ec2-user@<IP> "sudo certbot certificates"
Next Steps
- For Docker issues, see Docker Troubleshooting
- For local issues, see Local Troubleshooting
- Return to Troubleshooting Overview
- See the Production Deployment Guide