Bỏ qua

SMP Environment Matrix · dev / staging / prod

Audience: DevOps, Backend, Security


1. Environment overview

Env Purpose Data Access URL
local Developer's machine Seed only localhost http://localhost:8080
dev Shared dev integration Pseudo data All eng team https://dev-api.smp.vn
staging Pre-prod testing Pseudo + sample QC, BA, PM https://staging-api.smp.vn
prod Production Real data Customer-facing https://api.smp.vn

2. Infrastructure per env

Component local dev staging prod
K8s nodes docker-compose 1 2 3+
MySQL Docker 1 instance Managed (replica) Managed (HA + 2 replicas)
Redis Docker 1 node Cluster 3 nodes Cluster 6 nodes
MongoDB Docker 1 instance Atlas M10 Atlas M30
Kafka Docker 1 broker 3 brokers 5 brokers
Object storage Local FS Cloudflare R2 dev bucket R2 staging R2 prod
CDN None Cloudflare Cloudflare Cloudflare

3. URLs matrix

Frontend

App dev staging prod
Mobile (web) smp-mobile-dev.pages.dev smp-mobile-staging.pages.dev smp-mobile.pages.dev
Admin smp-admin-dev.pages.dev smp-admin-staging.pages.dev smp-admin.pages.dev

Backend

Service dev staging prod
API Gateway dev-api.smp.vn staging-api.smp.vn api.smp.vn
WebSocket dev-ws.smp.vn staging-ws.smp.vn ws.smp.vn
Internal services <svc>.dev.smp.local (k8s internal) <svc>.staging.smp.local <svc>.prod.smp.local

External integrations

System dev/staging prod
inside inside-staging.local inside.local
wms wms-staging.local wms.local

4. Resource sizing

Per service replicas

Service dev staging prod Auto-scale
api-gateway 1 2 3-10 CPU > 70%
order-svc 1 2 3-8 CPU > 70%
dispatch-engine 1 2 2-5 Custom (queue depth)
catalog-svc 1 2 2-4 CPU > 70%
agent-svc 1 1 2-4 CPU > 70%
partner-svc 1 1 2-3 CPU > 70%
finance-svc 1 1 2 None
quality-svc 1 1 2 CPU > 70%
integration-svc 1 2 3-6 Queue lag
notification-svc 1 1 2-4 Queue depth

Pod resources

Service CPU req CPU lim RAM req RAM lim
api-gateway 250m 500m 256Mi 512Mi
order-svc 500m 1000m 512Mi 1Gi
dispatch-engine 1000m 2000m 1Gi 2Gi
Others 250m 500m 256Mi 512Mi

5. Secrets management

5.1 Secret types

Secret type Storage Rotation
JWT signing key (RS256) HashiCorp Vault Annual + on compromise
DB passwords Vault Quarterly
Redis password Vault Quarterly
MongoDB connection string Vault Quarterly
Kafka SASL credentials Vault Quarterly
inside/wms API tokens Vault Per partner agreement
Webhook signing secrets Vault Annual
SSL/TLS certs Cloudflare Origin CA Auto-renew
Cloudflare API tokens Vault Annual
OAuth client secrets (inside) Vault Annual

5.2 Vault setup

Vault path structure:
  secret/smp/<env>/<service>/<key>

Examples:
  secret/smp/prod/order-svc/mysql_password
  secret/smp/prod/order-svc/redis_password
  secret/smp/prod/integration-svc/inside_api_token
  secret/smp/prod/api-gateway/jwt_signing_key

K8s pods access Vault via Vault Agent Injector (sidecar):

metadata:
  annotations:
    vault.hashicorp.com/agent-inject: "true"
    vault.hashicorp.com/role: "order-svc"
    vault.hashicorp.com/agent-inject-secret-mysql: "secret/smp/prod/order-svc"

5.3 Local dev secrets

Developer dùng .env file (gitignored). NOT in Vault.

.env.example (committed) chỉ có placeholder:

MYSQL_PASSWORD=<dev_password>
INSIDE_API_TOKEN=<get_from_team_lead>

Onboarding new dev: team lead share .env.dev qua password manager (1Password vault "SMP Engineering").

5.4 NEVER commit secrets

Pre-commit hook check gitleaks:

brew install gitleaks
gitleaks install

CI cũng chạy gitleaks ở PR check.

6. Configuration matrix

6.1 Per-env config

Most config in env vars. Default trong internal/config/config.go, override per env.

type Config struct {
    Server   ServerConfig
    MySQL    MySQLConfig
    Redis    RedisConfig
    Kafka    KafkaConfig
    Inside   IntegrationConfig
    WMS      IntegrationConfig
    Logging  LoggingConfig
    Tracing  TracingConfig
}

6.2 Feature flags

Per-env feature flags trong ConfigMap (k8s) hoặc dedicated service (Phase 2 dùng LaunchDarkly/Unleash).

Phase 1 dùng ConfigMap:

# k8s/prod/configmap-flags.yaml
data:
  ENABLE_PARTNER_PRIVATE_DISPATCH: "true"
  ENABLE_AUTO_VOUCHER_SUGGEST: "true"
  ENABLE_MULTI_AGENT_FLOW: "true"
  MATERIAL_VERIFY_AUTO_APPROVE_THRESHOLD_PCT: "5"

6.3 Per-env values matrix

Setting dev staging prod
LOG_LEVEL debug info info
DISPATCH_ROUND_TIMEOUT_SEC 30 60 60
DISPATCH_MAX_ROUNDS 3 3 5
ORDER_TIMEOUT_PAYMENT_MIN 5 15 30
WMS_RESERVATION_TTL_HOURS 1 24 24
CACHE_TTL_CUSTOMER_SEC 10 30 30
CACHE_TTL_WMS_STOCK_SEC 10 30 30
RATE_LIMIT_REQ_PER_MIN 1000 200 60
ENABLE_MOCK_PAYMENT true false false

6.5 Rules Engine deployment (v4.0)

Quy trình deploy rules_engine.yaml qua Kubernetes ConfigMap, mount vào pods, hot-reload.

Repo separation

github.com/trungnguyenchanh/smp-rules-config/   ← repo riêng
├── environments/
│   ├── dev/
│   │   └── rules_engine.yaml
│   ├── staging/
│   │   └── rules_engine.yaml
│   └── prod/
│       └── rules_engine.yaml
├── tests/
│   └── rules_test.yaml
├── .github/workflows/
│   ├── validate.yml     # YAML syntax + expression compile check
│   └── deploy.yml       # kubectl apply per env
└── README.md

ConfigMap manifest

# environments/prod/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: smp-rules-engine
  namespace: smp-prod
  labels:
    app: smp
    component: rules-engine
    version: "4.0.0"
data:
  rules_engine.yaml: |
    version: "4.0.0"
    # ... full content ...

Pod volume mount

# helm chart dispatch-engine/templates/deployment.yaml
spec:
  containers:
    - name: dispatch-engine
      volumeMounts:
        - name: rules-config
          mountPath: /etc/smp/rules
          readOnly: true
  volumes:
    - name: rules-config
      configMap:
        name: smp-rules-engine
        defaultMode: 0444   # read-only

Per-env override matrix

Setting dev staging prod
File path in pod /etc/smp/rules/rules_engine.yaml same same
ConfigMap source local dev: file mount smp-rules-config/staging smp-rules-config/prod
Auto-reload latency <5s (file watch on host) ~60s (k8s ConfigMap sync) ~60s
Approval to update Anyone Tech Lead BA Lead + Tech Lead
Test required None Run make rules-test Required + sign-off
Rollback Edit file kubectl rollout undo Git revert + ArgoCD sync

Deploy flow

1. BA edit rules_engine.yaml trong smp-rules-config repo
2. Create PR · CI runs:
   - YAML syntax validate (yamllint)
   - All expressions compile check (go run cmd/rule-lint)
   - Run rules_test.yaml fixtures (assert decisions)
3. Reviewer approve (BA Lead + Tech Lead cho prod)
4. Merge main
5. GitHub Actions deploy.yml:
   - kubectl create configmap smp-rules-engine \
     --from-file=rules_engine.yaml=environments/${ENV}/rules_engine.yaml \
     --dry-run=client -o yaml | kubectl apply -f -
6. ArgoCD detects ConfigMap change → reconcile
7. Kubelet syncs ConfigMap to pods (~60s)
8. Pods detect file change via fsnotify → reload rules
9. Verify via metrics dashboard:
   - rules_loaded_total (gauge)
   - rules_last_reload_timestamp (gauge)
   - rules_reload_errors_total (counter)

Rollback procedure

Quick rollback (last good version still in Git):

cd smp-rules-config
git revert HEAD            # or git checkout <good-sha> -- environments/prod/
git push origin main
# CD auto-deploys revert

Emergency rollback (skip Git, direct kubectl):

# Get previous ConfigMap revision
kubectl rollout history configmap/smp-rules-engine -n smp-prod
# Restore
kubectl rollout undo configmap/smp-rules-engine -n smp-prod

Disaster fallback (rules engine itself broken): - Services có embedded default-rules.yaml shipped trong Docker image - Nếu ConfigMap mount fail → fallback to default - Default rules = conservative VN defaults

Monitoring

Add Prometheus alerts:

- alert: RulesEngineReloadFailed
  expr: rate(rules_reload_errors_total[5m]) > 0
  for: 1m
  annotations:
    summary: "Rules engine reload failing on {{ $labels.pod }}"
    runbook: "https://docs.smp.vn/runbooks/rules-engine-reload-failed"

- alert: RulesEngineStale
  expr: time() - rules_last_reload_timestamp > 86400
  for: 5m
  annotations:
    summary: "Rules engine on {{ $labels.pod }} hasn't reloaded in 24h"

7. Data seeding

dev

Master data v3.3 từ JSON. Customer data fake (Vietnamese names library). 100 đơn random.

staging

Master data v3.3. Customer + agent + partner: synthetic 1000 records. Order history: 10000 đơn last 90 days.

prod

Master data only. Real customer/order data from launch.

Refresh procedure

  • dev: re-seedable any time make seed
  • staging: monthly refresh from synthetic generator
  • prod: NO seeding, real data only

8. Deployment pipeline summary

Developer push to feature branch
GitHub Actions CI runs
Open PR, peer review
Merge to main
Auto deploy to dev (within 5 min)
Manual promote to staging (PM approval)
QC regression test in staging
Manual promote to prod (release manager approval)
Deploy via blue/green to prod
Smoke test + monitor 30 min
Stable / rollback

Detail: xem CI/CD pipeline doc.

9. Access control

dev

  • All engineers: read + write (DB query, log access)
  • No PII (chỉ pseudo data)

staging

  • Engineers: read-only DB
  • QC: read-only DB
  • Ops manager: read + manual UAT actions

prod

  • Engineers: NO direct DB access (except SRE on-call via bastion)
  • Customer support: read-only via admin portal
  • Ops manager: admin portal access only
  • Finance: read-only finance domain
  • All access logged in audit log

Bastion access

# Engineer cần debug prod issue
ssh bastion.smp.vn       # MFA required
sudo -u smp-readonly mysql -h prod-mysql.local
# Session recorded, auto-terminate after 30 min idle

10. Environment promotion

From To Trigger Approval
main branch dev Auto on merge None
dev staging Manual via UI PM + tech lead
staging prod Manual via UI Release manager + on-call SRE

Rollback: 1-click revert to previous deployment image in K8s.

11. DNS & domains

Domain Owner Purpose
smp.vn Company Marketing site
api.smp.vn DevOps Prod API
app.smp.vn DevOps Prod app shortcut
dev-api.smp.vn DevOps Dev API
staging-api.smp.vn DevOps Staging API
admin.smp.vn DevOps Admin portal
partner.smp.vn DevOps Partner portal landing (v3.3 redirect to admin)
docs.smp.vn DevOps Public API docs
status.smp.vn DevOps Status page (statuspage.io)

All TLS via Cloudflare Origin CA + WAF rules.

12. Monitoring per env

Tool dev staging prod
Logs stdout + Loki Loki Loki + S3 cold storage
Metrics Prometheus Prometheus Prometheus + Mimir long-term
Traces Jaeger local Jaeger Jaeger + Tempo
Alerts None Slack #dev-alerts PagerDuty + Slack #prod-alerts
Uptime None Pingdom Pingdom (1-min checks)

13. Cost budgets (monthly)

Env Budget (USD) Tracking
dev $300 Cloudflare + minimal cluster
staging $800 2 nodes + small DBs
prod $3000-5000 Scale by traffic

DevOps review monthly, alert if > 110% budget.


14. Multi-region deployment (v4.0 sovereignty)

Khi launch global, mỗi region cần cluster riêng theo data sovereignty laws. Section này document deployment topology + routing + costs.

14.1 Cluster topology

                       ┌────────────────────────────────┐
                       │  Global Control Plane          │
                       │  (Observability + CI/CD only,  │
                       │   no PII data)                 │
                       │   Region: us-east-1 (Virginia) │
                       └────────────────────────────────┘
        ┌──────────────────────────────┼─────────────────────────────┐
        │                              │                              │
        ▼                              ▼                              ▼
┌───────────────┐             ┌───────────────┐              ┌───────────────┐
│ smp-asia      │             │ smp-china     │              │ smp-us        │
│ cluster       │             │ cluster       │              │ cluster       │
│               │             │               │              │               │
│ Region: SG    │             │ Region: BJ    │              │ Region: VA    │
│ ap-southeast-1│             │ cn-beijing    │              │ us-east-1     │
│               │             │ (AliCloud)    │              │ (AWS)         │
│               │             │               │              │               │
│ Countries:    │             │ Countries:    │              │ Countries:    │
│  VN, TH, SG,  │             │  CN (only)    │              │  US (only)    │
│  ID, MY, PH   │             │               │              │               │
└───────────────┘             └───────────────┘              └───────────────┘

14.2 Cluster sizing per region

Cluster Region Provider Estimated DAU K8s nodes DB tier Est. cost/month
smp-asia ap-southeast-1 (SG) AWS 100k 6× m5.large RDS MySQL db.r6g.large $4000
smp-china cn-beijing Alibaba Cloud TBD launch Q3 2026 3× ecs.c6.large RDS MySQL 2-node $1500
smp-us us-east-1 (Virginia) AWS TBD launch Q4 2026 3× m5.large RDS MySQL db.r6g.large $2500
Global control plane us-east-1 AWS n/a 2× t3.large none (just observability) $400

Total at full scale: ~$8400/month (vs $3000-5000 single-region today).

14.3 Cross-region data flows

Data type Direction Mechanism Latency Compliance
Master data (smp_global: countries, currencies, rates, tax, i18n) Global → All regions Kafka MirrorMaker pull < 5min OK (no PII)
Aggregated analytics (KPI metrics, anonymized) All regions → Global Kafka MirrorMaker push < 15min Anonymized only
Audit log (immutable, encrypted) All regions → Global archive Async batch (daily) 24h Encrypted + key rotation
PII data (orders, customers, agents, partners) NEVER cross region n/a n/a Strict
Application metrics (Prometheus) All regions → Global Mimir long-term Real-time No PII
Application logs (Loki) Each region local + sanitized index → Global Async filtered < 1h PII redacted before egress

14.4 Latency map (typical)

From → To Latency p50 Latency p99
smp-asia (SG) ↔ VN users 30-50ms 80ms
smp-asia (SG) ↔ TH users 25ms 60ms
smp-asia (SG) ↔ ID users 40ms 90ms
smp-china (BJ) ↔ CN users 20-40ms 70ms
smp-us (VA) ↔ US East users 20ms 50ms
smp-us (VA) ↔ US West users 65ms 100ms
Inter-region (SG ↔ BJ) 200ms 400ms
Inter-region (SG ↔ VA) 250ms 500ms

14.5 Routing & client resolution

API Gateway at edge (Cloudflare Workers) determines target cluster:

// Pseudo-code
function resolveCluster(request) {
  // 1. Check JWT (authenticated user)
  const jwt = getJWT(request);
  if (jwt?.country_code) {
    return clusterFor(jwt.country_code);
  }

  // 2. Geolocation from Cloudflare CF-IPCountry header
  const country = request.headers['CF-IPCountry'];
  if (country) {
    return clusterFor(country);
  }

  // 3. Default to nearest cluster by datacenter
  return defaultClusterByDC(request.cf.colo);
}

function clusterFor(country) {
  const map = {
    'CN': 'smp-china',
    'US': 'smp-us',
    // VN, TH, SG, ID, MY, PH, etc → smp-asia
  };
  return map[country] || 'smp-asia';
}

Critical: routing decision is sticky per session. Once user logged in, all subsequent requests go to same cluster (avoid data hopping).

14.6 Cross-region failover

Each cluster is independent — failure of 1 cluster does NOT affect others.

Cluster down Impact Mitigation
smp-asia VN/TH/SG/ID/MY/PH users see error Cloudflare maintenance page · cannot serve from other clusters (sovereignty)
smp-china CN users see error Same — cannot fallback to other regions
smp-us US users see error Same
Global control plane Observability/CI/CD down, but apps OK Apps continue running with last-known config

Implication: each cluster needs full HA (3 AZ minimum). NO cross-region failover for user data.

14.7 Deployment sequence (per release)

1. Engineer merges PR to main
2. CI builds Docker image (single image for all regions)
3. Push image to:
   - ECR ap-southeast-1 (smp-asia)
   - ACR cn-beijing (smp-china) - via VPN/proxy
   - ECR us-east-1 (smp-us)
4. ArgoCD deploys to each cluster sequentially:
   - dev (single region) → smoke test
   - staging (smp-asia only) → integration test
   - prod smp-asia → 10% traffic canary (24h soak)
   - prod smp-asia → 100% rollout
   - prod smp-us → canary (if applicable)
   - prod smp-china → canary (manual approval required)

China deployment is manual: due to firewall + ICP filing requirements, smp-china deploys are separate process owned by local DevOps in CN team.

14.8 Per-region cost monitoring

Tag every cloud resource với region=<cluster-id> for cost allocation:

# Terraform tags
tags = {
  Environment   = "prod"
  Cluster       = "smp-asia"
  CountryGroup  = "VN-TH-SG-ID-MY-PH"
  CostCenter    = "engineering"
  Compliance    = "PDPA"
}

Cost dashboard split by Cluster tag. Alert if any cluster > 110% budget.