Skip to content
When COBOL Fails: Real-Time Error Management with S3 and JSON
todd-bernson-leadership

Introduction

In any system that processes large volumes of data, failure is inevitable. And when you're running legacy COBOL applications as part of a modern pipeline, error handling becomes even more critical. COBOL programs weren’t built to emit structured error logs or integrate with cloud-native monitoring tools. So when something breaks—bad input, missing fields, logic bugs—you need a system that captures, logs, and routes those failures in a way that’s actionable.

In this article, we’ll dive deep into the error handling strategy behind our eks_cobol project. You’ll learn how we built a fault-tolerant process that logs COBOL errors to Amazon S3 in structured JSON format, enabling real-time observability and downstream integration with ML services for predictive insights. This is a story of wrapping an ancient workhorse in battle armor and plugging it into the cloud.

Why Error Handling Needs to Be Rethought for Legacy Code

COBOL’s default behavior when encountering errors is to crash, print something vaguely helpful to STDOUT, or continue silently failing in ways that break downstream logic. That might have flown in the 1980s, but today’s systems demand traceability, alerts, and remediation.

The goal isn’t just to capture when a COBOL job fails, but why it failed, and then to package that information for:

  • Debugging by engineers
  • Reruns with fixed data
  • Machine learning analysis
  • Visual dashboards or ticket automation

Error Capture Strategy: Let COBOL Do Its Thing, Then Intercept

We don’t modify the COBOL program much—instead, we catch its output and behavior externally. Here's how it works:

  1. The COBOL job runs inside a Kubernetes pod using a shell wrapper script.
  2. STDOUT is redirected to an output JSON file, and STDERR is redirected to an .error.json file.
  3. Exit codes and log contents are evaluated at the end of the job to determine success or failure.
  4. If a failure is detected, the .error.json file is uploaded to a dedicated S3 bucket using the AWS CLI or SDK.

The shell script looks like this:

#!/bin/bash
set -e

cobc -x -free TransformCSV.cbl -o TransformCSV

if ./TransformCSV > /mnt/data/output/output.json 2> /mnt/data/output/error.log; then
    echo "Job completed successfully."
else
    echo "Job failed. Parsing errors..."
    python3 parse_error.py /mnt/data/output/error.log /mnt/data/output/error.json
    aws s3 cp /mnt/data/output/error.json s3://my-cobol-errors-bucket/errors/
    exit 1
fi

This keeps the actual COBOL code clean and lets the outer logic handle the complexity of error interpretation and routing.

Structured Error Files: JSON as a Contract

One of the key improvements we made was transforming COBOL errors into structured JSON. This creates a consistent contract for downstream consumers. Here’s an example of what one of those error files looks like:

{
  "jobId": "e8f3d9d4-1c9b-4c7b-b9e2-f2345a3a9c92",
  "timestamp": "2025-04-03T19:32:10Z",
  "status": "failed",
  "errorType": "DataFormatError",
  "message": "Invalid date format in field 6",
  "inputFile": "customers_202504.csv",
  "line": 42,
  "rawRecord": "A123,John,Doe,04/35/2024,ACTIVE"
}

These JSON error logs are easier to search, visualize, and feed into automated systems than raw console logs. We’ve built tooling to parse STDERR into this format using a small Python script (parse_error.py) with regex patterns customized for our COBOL compiler output.

S3 as a Scalable, Searchable Error Store

S3 gives us durability, versioning, and cost-effective long-term storage. Error logs are pushed into paths that follow this structure:

s3://my-cobol-errors-bucket/errors/yyyy/mm/dd/job-id-error.json

With S3 events and Lambda, we can even trigger workflows when new errors are detected—such as:

  • Notifying a Slack channel
  • Creating a Jira ticket
  • Invoking a SageMaker pipeline to retrain our error prediction model

We also periodically batch-query this data using Amazon Athena or AWS Glue to produce metrics like:

  • Top 10 most common errors
  • Failures by input file type
  • Average failure rate by job type

Connecting Error Data to Amazon SageMaker

The structured error logs we store in S3 serve double duty. Beyond observability, we use them as labeled training data for a SageMaker model that predicts whether a COBOL job is likely to fail, based on characteristics of the input file (filename patterns, content, size, date range, etc.).

When a new file hits the ingestion queue, it’s first evaluated by this model. If the model flags it as high risk, the system can:

  • Route it to a special validation lane
  • Run a “dry run” job with stricter logging
  • Alert the data owner for review

This proactive capability didn’t exist in the mainframe era—but it’s made possible by converting COBOL’s black box behavior into structured, analyzable events.

Observability and Tracing

Once errors are stored in JSON and surfaced through S3, we have the ability to plug them into:

  • CloudWatch Metrics: Tracking success/failure over time
  • QuickSight Dashboards: Showing trends per job type or region
  • Prometheus/Grafana: For real-time job status visualization
  • OpenTelemetry: For tracing execution from data ingestion to COBOL run to S3 upload

Each error record contains a jobId that links all logs, inputs, outputs, and metrics together. This gives engineers full end-to-end traceability for debugging or audits.

Conclusion

Legacy COBOL systems don't have to be black boxes. By wrapping them in smart containers, capturing their errors in structured JSON, and offloading that data to S3, we've created a system that is observable, maintainable, and even trainable.

This error handling architecture is key to unlocking modernization. It helps developers respond faster, empowers data teams to improve quality, and provides ML teams with real-world data to build predictive systems. Even when COBOL fails, it’s now part of a smarter system that learns and improves over time.

Related Articles

Inter-Region WireGuard VPN in AWS

Read more

Making PDFs Searchable Using AWS Textract and CloudSearch

Read more

Slack AI Bot with AWS Bedrock Part 2

Read more

Contact Us

Achieve a competitive advantage through BSC data analytics and cloud solutions.

Contact Us