Skip to content

CI Job False Positives: OPAM Install Errors Not Failing Build

Problem Summary

Our CI pipeline has a critical issue where jobs are showing errors during the OPAM install phase but are not failing the build. This creates dangerous false positives where we think builds are passing when they're actually failing.

Evidence

Discovered in MR !14: !14 Failing CI Job: https://git.haley.io/john/melange-mvp-template/-/jobs/1154#L51

Example Error (Job Shows Green but Contains Errors)

The CI job shows as "passed" but contains errors like:

[ERROR] Package odoc.3.1.0 is not available in lock files, but is present in opam files
[ERROR] Package odoc-parser.3.1.0 is not available in lock files, but is present in opam files

Root Cause Analysis

  1. OPAM Lock File Mismatches: The .opam files contain odoc {with-doc} dependencies but the corresponding .opam.locked files don't include these packages
  2. CI Error Handling: The CI jobs are not properly failing when OPAM install steps encounter errors
  3. Silent Failures: This allows broken dependency configurations to pass CI, potentially reaching production

Impact

  • False Confidence: Developers think builds are passing when they're failing
  • Broken Deployments: Code with dependency issues could be deployed
  • Debugging Overhead: Time wasted investigating "working" code that actually has issues
  • Technical Debt: Accumulation of dependency mismatches over time

Investigation Done in MR !14

During our attempt to fix the odoc warnings, we:

  1. Identified that all .opam files contain "odoc" {with-doc} dependencies
  2. Attempted to regenerate .opam.locked files with OPAMWITHDOC=1 opam lock
  3. Successfully added odoc to lock files but discovered the CI was hiding errors
  4. Added OPAM management guidelines to CLAUDE.md including:
    • Never manually edit .opam files (they're generated from dune-project)
    • Never manually edit .opam.locked files (regenerate with opam lock)
    • Proper procedures for dependency management
  5. Rolled back changes to avoid masking the underlying CI issue

CLAUDE.md Memory Updates

Added comprehensive OPAM file management guidelines to prevent future manual editing errors:

### OPAM File Management

- **NEVER modify `.opam` files directly** - they are generated from `dune-project`
- **To modify dependencies**: Edit `dune-project` and regenerate opam files with `dune build @all`
- **NEVER manually edit `.opam.locked` files** - regenerate them using `OPAMWITHDOC=1 opam lock ./package.opam`
- **For CI warnings about missing dependencies in lock files**: regenerate lock files rather than manually editing them

Required Fixes

Immediate (Critical)

  1. Fix CI Error Handling: Ensure OPAM install errors cause job failure
  2. Add Exit Code Validation: Verify that non-zero exit codes from OPAM commands fail the build
  3. Review All CI Jobs: Check other jobs for similar silent failure patterns

Follow-up (Important)

  1. Fix OPAM Lock Dependencies: Create separate MR to properly resolve odoc lock file issues
  2. Add CI Monitoring: Implement checks to catch when jobs pass with errors
  3. Documentation: Update CI setup docs to prevent similar issues

Acceptance Criteria

  • CI jobs fail when OPAM install encounters errors
  • Job logs clearly show why builds failed
  • False positive rate reduced to zero for dependency issues
  • All existing "passing" builds re-evaluated for hidden errors

Labels

  • bug - This is a critical bug in our CI system
  • priority::critical - False positives are dangerous for production
  • type::ci - CI/CD infrastructure issue
  • phase::foundation - Blocking foundation development work

Related

  • MR !14: Database Foundation PostgreSQL Setup (where this was discovered)
  • Future MR: Separate fix for OPAM lock file odoc dependencies
Edited by John Haley