Skip to content
Building a Smart Feedback Loop: Real-Time Inference on COBOL Logs
todd-bernson-leadership

Introduction

Modern data pipelines don't stop at processing—they evolve. With our eks_cobol system running legacy COBOL code on Kubernetes and logging structured outputs, we’ve laid the foundation for a smarter system. Now it’s time to close the loop.

In this article, we show how we could integrate the SageMaker model from Article 5 into a real-time feedback loop. Instead of just reacting to COBOL job results, we proactively intercept bad inputs before they cause failure. We’ll cover how inference is triggered pre-execution, how results are logged and acted upon, and how this closes the loop between batch legacy logic and modern ML-based automation.

The Loop: From Prediction to Action

Here’s the basic feedback loop:

  1. File is ingested and analyzed.
  2. Metadata is extracted (size, record count, filename, etc.).
  3. Metadata is sent to the SageMaker inference endpoint.
  4. If the predicted probability of failure > threshold:
    • File is flagged or quarantined.
    • User is alerted.
    • Optionally skipped from COBOL execution.
  5. Otherwise, the file proceeds to COBOL job processing.

We use the exact SageMaker endpoint created in Article 5 to power the loop.

Trigger Point: Right After File Ingest

The feedback loop starts after a file lands in the mounted EFS directory. Our ingestion service performs lightweight analysis—no full record parsing, just enough metadata for inference.

Example features:

  • Byte size (os.path.getsize)
  • Filename pattern (date, region)
  • Number of records (quick line count)
  • Known anomalies (e.g., blank lines)

We wrap this logic in a predict_failure_risk() function that calls the SageMaker endpoint.

def predict_failure_risk(input_file_path):
    size = os.path.getsize(input_file_path)
    name = os.path.basename(input_file_path)

    # Create simple one-hot encoding for file extension
    extension = name.split('.')[-1]
    ext_flags = [1 if extension == 'csv' else 0]  # Extend for more types as needed

    # Simulated other features
    features = [size] + ext_flags
    payload = ','.join(map(str, features))

    response = boto3.client('sagemaker-runtime').invoke_endpoint(
        EndpointName='cobol-failure-predictor',
        ContentType='text/csv',
        Body=payload
    )

    score = float(response['Body'].read().decode())
    return score

If the returned score exceeds our threshold (0.8 for high confidence), we act.

Risk Routing: High vs. Low Confidence Paths

We define 3 potential paths based on model confidence:

  1. Low Risk (< 0.5): File is processed normally.
  2. Medium Risk (0.5–0.8): File is tagged but proceeds; alerts may be logged.
  3. High Risk (> 0.8): File is moved to /mnt/data/quarantine/, skipped from execution, and flagged for review.

These thresholds are tunable based on model accuracy, job cost, and risk tolerance.

The routing logic is embedded into the controller script before the COBOL job kicks off:

score = predict_failure_risk('/mnt/data/input/job123.csv')

if score > 0.8:
    print("High failure risk. Skipping COBOL execution.")
    move_to_quarantine('/mnt/data/input/job123.csv')
elif score > 0.5:
    print("Medium risk. Proceeding with caution.")
else:
    print("Low risk. Running job.")
    run_cobol('/mnt/data/input/job123.csv')

Logging and Traceability

For every prediction, we log:

  • Job ID
  • Score
  • Action taken
  • Timestamp

These logs are sent to CloudWatch and optionally to a DynamoDB "job decisions" table for auditing.

{
  "jobId": "job123",
  "score": 0.91,
  "decision": "quarantined",
  "timestamp": "2025-04-03T18:12:30Z"
}

This gives us full traceability from ingestion through prediction to final action.

Feedback into the Model

To keep the loop smart, we must evolve the model. So, for every prediction that results in:

  • A correct decision → reinforce via logs.
  • A wrong decision → flag for retraining.

A Lambda function watches the quarantine bucket. If a file in quarantine is later processed successfully by an engineer, it’s tagged as a false positive and fed into the retraining dataset. This self-healing process makes the model more precise over time.

Business Impact

Before this feedback loop, bad jobs would:

  • Run anyway, wasting CPU time.
  • Cause cascading failures in downstream services.
  • Require postmortem triage.

Now, we proactively flag risky inputs. Engineers focus only on edge cases. Overall job success rates improve, and so does trust in the system.

This loop also enables us to A/B test different models, thresholds, and routing logic—giving us a lab for optimization without interrupting the production flow.

Conclusion

COBOL jobs don’t have to be dumb. By wrapping them in modern ML pipelines, we get real-time intelligence that prevents failures before they happen. SageMaker gives us prediction. Kubernetes gives us orchestration. And a simple controller gives us the glue to wire it all together.

With a smart feedback loop in place, eks_cobol becomes more than a modernization play—it becomes a self-improving system that learns from its own failures.

Related Articles

Inter-Region WireGuard VPN in AWS

Read more

Making PDFs Searchable Using AWS Textract and CloudSearch

Read more

Slack AI Bot with AWS Bedrock Part 2

Read more

Contact Us

Achieve a competitive advantage through BSC data analytics and cloud solutions.

Contact Us