
Introduction
Modern data pipelines don't stop at processing—they evolve. With our eks_cobol
system running legacy COBOL code on Kubernetes and logging structured outputs, we’ve laid the foundation for a smarter system. Now it’s time to close the loop.
In this article, we show how we could integrate the SageMaker model from Article 5 into a real-time feedback loop. Instead of just reacting to COBOL job results, we proactively intercept bad inputs before they cause failure. We’ll cover how inference is triggered pre-execution, how results are logged and acted upon, and how this closes the loop between batch legacy logic and modern ML-based automation.
The Loop: From Prediction to Action
Here’s the basic feedback loop:
- File is ingested and analyzed.
- Metadata is extracted (size, record count, filename, etc.).
- Metadata is sent to the SageMaker inference endpoint.
- If the predicted probability of failure > threshold:
- File is flagged or quarantined.
- User is alerted.
- Optionally skipped from COBOL execution.
- Otherwise, the file proceeds to COBOL job processing.
We use the exact SageMaker endpoint created in Article 5 to power the loop.
Trigger Point: Right After File Ingest
The feedback loop starts after a file lands in the mounted EFS directory. Our ingestion service performs lightweight analysis—no full record parsing, just enough metadata for inference.
Example features:
- Byte size (
os.path.getsize
) - Filename pattern (date, region)
- Number of records (quick line count)
- Known anomalies (e.g., blank lines)
We wrap this logic in a predict_failure_risk()
function that calls the SageMaker endpoint.
def predict_failure_risk(input_file_path):
size = os.path.getsize(input_file_path)
name = os.path.basename(input_file_path)
# Create simple one-hot encoding for file extension
extension = name.split('.')[-1]
ext_flags = [1 if extension == 'csv' else 0] # Extend for more types as needed
# Simulated other features
features = [size] + ext_flags
payload = ','.join(map(str, features))
response = boto3.client('sagemaker-runtime').invoke_endpoint(
EndpointName='cobol-failure-predictor',
ContentType='text/csv',
Body=payload
)
score = float(response['Body'].read().decode())
return score
If the returned score exceeds our threshold (0.8 for high confidence), we act.
Risk Routing: High vs. Low Confidence Paths
We define 3 potential paths based on model confidence:
- Low Risk (< 0.5): File is processed normally.
- Medium Risk (0.5–0.8): File is tagged but proceeds; alerts may be logged.
- High Risk (> 0.8): File is moved to
/mnt/data/quarantine/
, skipped from execution, and flagged for review.
These thresholds are tunable based on model accuracy, job cost, and risk tolerance.
The routing logic is embedded into the controller script before the COBOL job kicks off:
score = predict_failure_risk('/mnt/data/input/job123.csv')
if score > 0.8:
print("High failure risk. Skipping COBOL execution.")
move_to_quarantine('/mnt/data/input/job123.csv')
elif score > 0.5:
print("Medium risk. Proceeding with caution.")
else:
print("Low risk. Running job.")
run_cobol('/mnt/data/input/job123.csv')
Logging and Traceability
For every prediction, we log:
- Job ID
- Score
- Action taken
- Timestamp
These logs are sent to CloudWatch and optionally to a DynamoDB "job decisions" table for auditing.
{
"jobId": "job123",
"score": 0.91,
"decision": "quarantined",
"timestamp": "2025-04-03T18:12:30Z"
}
This gives us full traceability from ingestion through prediction to final action.
Feedback into the Model
To keep the loop smart, we must evolve the model. So, for every prediction that results in:
- A correct decision → reinforce via logs.
- A wrong decision → flag for retraining.
A Lambda function watches the quarantine bucket. If a file in quarantine is later processed successfully by an engineer, it’s tagged as a false positive and fed into the retraining dataset. This self-healing process makes the model more precise over time.
Business Impact
Before this feedback loop, bad jobs would:
- Run anyway, wasting CPU time.
- Cause cascading failures in downstream services.
- Require postmortem triage.
Now, we proactively flag risky inputs. Engineers focus only on edge cases. Overall job success rates improve, and so does trust in the system.
This loop also enables us to A/B test different models, thresholds, and routing logic—giving us a lab for optimization without interrupting the production flow.
Conclusion
COBOL jobs don’t have to be dumb. By wrapping them in modern ML pipelines, we get real-time intelligence that prevents failures before they happen. SageMaker gives us prediction. Kubernetes gives us orchestration. And a simple controller gives us the glue to wire it all together.
With a smart feedback loop in place, eks_cobol
becomes more than a modernization play—it becomes a self-improving system that learns from its own failures.