Teaching AI to Behave: Structured Output with Amazon Bedrock in Production

The moment AI becomes useful isn’t when it produces impressive text. It’s when it becomes predictable enough to behave like a system component.

There’s a moment when generative AI moves from interesting to useful. It usually happens the first time you integrate an LLM into a real system. Not a demo. Not a playground. An actual workflow with downstream dependencies.

That’s when the problem appears. The model talks too much, too loosely, too creatively. Suddenly your architecture depends on parsing paragraphs.

That’s where structured output changes everything.

Image

The problem most builders hit

When we first introduce AI into an application, the natural instinct is simple.

Call the model.
Get text.
Extract what we need.

It works. Until it doesn’t.

To explore this properly, consider a platform that processes unstructured input, evaluates it using AI, and produces structured signals that feed downstream workflows.

In that platform, AI isn’t decorative. It sits inside a workflow that turns unstructured input into structured signals the system can trust.

In other words, the AI is part of a pipeline. It is not the destination.

Free-form text becomes a liability.

  • You end up writing brittle regex.
  • You add defensive parsing.
  • You introduce silent failures.

And worst of all, you lose trust in the result.

So the goal wasn’t “use AI”.

The goal was:

Make AI behave like a component.


Why “respond with JSON” isn’t enough

Most builders start by telling the model to return JSON.

It feels structured. It looks structured. It often works.

Until it doesn’t.

You’ll see:

  • Missing fields
  • Extra fields
  • Invalid JSON
  • Fields renamed subtly
  • Narrative leaking into structured sections

That creates a new class of reliability work. AI repair logic.

Ironically, the more important the workflow becomes, the more defensive code you write around the AI.

Which is backwards.

The system shouldn’t adapt to the model.
The model should adapt to the system.


The reliability problem behind “just return JSON”

There’s a deeper issue that most builders eventually discover.

LLMs don’t generate JSON.
They generate text that looks like JSON.

That distinction matters.

Even with careful prompting you will eventually see:

  • Token truncation producing partial objects
  • Valid JSON with missing required fields
  • Subtle type drift (strings vs numbers)
  • Extra properties the system didn’t ask for
  • Structurally valid output that is semantically wrong
  • Streaming responses cut mid-structure

None of these are unusual. They are a natural side effect of probabilistic generation.

Which means asking for JSON is not a reliability strategy.

It is a hope.

Production systems shouldn’t depend on hope.


Enter Amazon Bedrock Structured Outputs

Amazon Bedrock’s structured output capability gives you a way to solve this properly.

Instead of asking the model to return JSON, you define a schema. Bedrock enforces that schema at generation time.

The shift is subtle but important.

You stop hoping the model follows instructions.
You give the model a contract.

That contract becomes architecture.

If the model cannot produce valid output, the call fails early. Production systems prefer predictable failure.


Where it fits in the system

Inside the platform, structured output sits within an automated evaluation workflow.

At a high level:

  1. Unstructured content is submitted via an upload or API
  2. The backend extracts relevant text or signals
  3. Bedrock evaluates the content against defined criteria
  4. The result is persisted as a structured record

Before structured output, the evaluation step was fragile.

After structured output, it became deterministic.

The AI now returns a predictable object that the system can trust.


Designing the schema (this is the real work)

The biggest learning wasn’t technical.

It was architectural.

A schema forces clarity.

You have to decide:

  • What the system actually needs
  • What the model should not invent
  • Which fields drive decisions versus explanation
  • Where uncertainty must be visible

In practice, the schema focuses on signals:

  • Overall match score
  • Criteria alignment breakdown
  • Strengths
  • Gaps
  • Confidence
  • Recommendation summary

The schema intentionally biases toward signals, not prose.


Code sample 1: define a compact schema contract

This is a simplified schema that captures decision signals without turning the model into a report writer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "overall_score": { "type": "number", "minimum": 0, "maximum": 100 },
    "confidence": { "type": "number", "minimum": 0, "maximum": 1 },
    "strengths": { "type": "array", "items": { "type": "string" } },
    "gaps": { "type": "array", "items": { "type": "string" } },
    "recommendation": { "type": "string" }
  },
  "required": ["overall_score", "confidence", "strengths", "gaps"]
}

A couple of intentional choices:

  • additionalProperties: false keeps the output tight and predictable.
  • required is small. You can add fields later, but start with what the system truly needs.

Code sample 2: invoke Bedrock with structured output

This example uses the Bedrock Runtime converse API. The important part is that the schema is sent as part of the request, and you treat the structured output as the primary payload.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import boto3

bedrock = boto3.client("bedrock-runtime")

schema = {
    "type": "object",
    "additionalProperties": False,
    "properties": {
        "overall_score": {"type": "number", "minimum": 0, "maximum": 100},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "strengths": {"type": "array", "items": {"type": "string"}},
        "gaps": {"type": "array", "items": {"type": "string"}},
        "recommendation": {"type": "string"},
    },
    "required": ["overall_score", "confidence", "strengths", "gaps"],
}

messages = [
    {
        "role": "user",
        "content": [
            {"text": "Evaluate the input against the criteria."},
            {"text": "Return ONLY structured output that matches the schema."},
            {"text": f"INPUT:\n{input_text}"},
            {"text": f"CRITERIA:\n{criteria_text}"},
        ],
    }
]

resp = bedrock.converse(
    modelId="YOUR_MODEL_ID",
    messages=messages,
    inferenceConfig={
        "temperature": 0,
        "maxTokens": 1200,
    },
    outputConfig={
        "structuredOutput": {
            "schema": schema
        }
    },
)

structured = resp["output"]["structuredOutput"]["data"]
print(structured["overall_score"], structured["confidence"])

Two practical notes:

  • Keep temperature low for decision workflows.
  • Keep your schema small so it is easy to satisfy and easy to test.

Thoughtworks context: why this mattered internally

Within Thoughtworks, much of the platform work around delivery readiness, capability intelligence, and internal decision support depends on consistent signals rather than narrative interpretation.

Structured output allowed us to standardise how AI contributes to internal decision-making without introducing additional manual validation steps.

Instead of reviewing long AI summaries, teams can consume structured signals directly inside workflows. This reduces friction in essential procedures.

That shift is subtle, but operationally significant.


Code sample 3: validate and version the contract

Structured output reduces drift, but you still want a clean contract boundary inside your own code. In practice that means validating the response and versioning the schema.

Here is a minimal validation pattern using Pydantic. It gives you a single place to enforce types and defaults.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from pydantic import BaseModel, Field, conlist, confloat

class EvaluationV1(BaseModel):
    overall_score: confloat(ge=0, le=100)
    confidence: confloat(ge=0, le=1)
    strengths: conlist(str, min_length=1) = Field(default_factory=list)
    gaps: conlist(str, min_length=0) = Field(default_factory=list)
    recommendation: str | None = None

def parse_evaluation(payload: dict) -> EvaluationV1:
    return EvaluationV1.model_validate(payload)

# Example usage:
# evaluation = parse_evaluation(structured)
# store(evaluation.model_dump(), schema_version="v1")

This buys you two things:

  • Your downstream code stays stable even if your prompts evolve.
  • You can change the contract intentionally by introducing EvaluationV2.

Why this matters beyond AI

Structured output creates a boundary.

It turns AI from a conversation into a service dependency.

Once that boundary exists:

  • Validation shifts left
  • Persistence becomes straightforward
  • Observability improves
  • Testing becomes possible

You can now write tests against the shape of AI output.

That is a meaningful step toward production readiness.


Practical lessons learned

Smaller schemas work better.
Separate signals from explanation.
Confidence changes automation.
Treat schema as versioned contract.
Failure modes improve.


What changed in the platform

Introducing structured output didn’t add complexity.

It removed it.

The workflow became:

  • more predictable
  • easier to debug
  • safer to evolve
  • easier to test
  • cheaper to maintain

AI began to feel less magical and more like infrastructure.


Zooming out

We are moving from prompt engineering toward interface design. The question is no longer what you should ask the model. The question is what contract should exist between the model and the system. Structured output makes that explicit and nudges architects to design AI the same way we design APIs.

In early AI integrations, the hardest part is not calling the model. It is trusting the result. Structured output helps earn that trust, not because it makes the model smarter, but because it makes the system clearer. And clarity scales. For builders, that is the real value.