Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review

Be patient with your prompting and work with your AI like a new engineer.

Understanding AI Evaluations

The term Eval or Evaluations refers to the process of assessing the performance and effectiveness of an AI model based on specified criteria. If you want to get better results out of your AI, then you need to "evaluate" its answer against a bounded data set within a digital twin.

These answers then need to be "evaluated" by a domain expert, like a consulting engineer.

The Two-Layer Evaluation Process

Layer 1: Digital Twin Evaluation

The first evaluation happens automatically within the digital twin framework, checking:

Asset Accuracy:

Which assets were referenced in the response
Whether the correct assets were identified
Accuracy score (0-100)

Data Accuracy:

Which metrics were used
Current values vs. historical baseline
Deviation score (0-100)

Rule Compliance:

Which operational rules were applied
Whether applications were correct
Compliance score (0-100)

Overall Evaluation:

Overall score (0-100)
Confidence level (high, medium, or low)

Layer 2: Expert Evaluation

The second evaluation involves human domain expertise:

Evaluator Information:

Name and credentials
Years of experience in the field

Assessment Criteria (1-5 scale):

Technical accuracy
Practical viability
Safety considerations
Cost effectiveness
Code compliance

Expert Feedback:

What was correct in the AI response
What was incorrect or missing
Missed considerations
Suggestions for improvement

Final Verdict: Approve, revise, or reject

Why This Two-Layer Approach Works

The AI as a Junior Engineer

Think of your AI assistant as a bright but inexperienced junior engineer. They have access to a lot of information and can make connections quickly, but they:

Lack real-world experience - Haven't seen what actually works in the field
Need guidance - Benefit from correction and feedback
Improve with practice - Get better as they learn from evaluations
Require supervision - Should have their work reviewed by experts

Just as you wouldn't let a brand-new engineer make critical decisions without review, you shouldn't blindly trust AI outputs without evaluation.

Implementing the Evaluation Framework

Step 1: Define Evaluation Criteria

For building operations, key criteria include:

Technical Criteria:

Equipment compatibility
Capacity matching
Energy efficiency
System integration

Practical Criteria:

Installation feasibility
Maintenance requirements
Spare parts availability
Operator training needs

Economic Criteria:

First costs
Operating costs
Lifecycle costs
Return on investment

Regulatory Criteria:

Building codes
Energy codes
Safety standards
Environmental regulations

Step 2: Create Bounded Test Cases

Build a library of known scenarios with verified correct answers:

Example Test Case: Office Building Cooling Load

Scenario:

Description: Office building cooling load increase
Building: Office-A
Chiller: Chiller-01
Current load: 450 tons
Design load: 500 tons
Question: "Our cooling load increased to 480 tons. Do we need a new chiller?"

Expected Answer:

Recommendation: No new chiller required. Current unit can handle 480 tons.
Reasoning:
- Current chiller (Chiller-01) design capacity: 500 tons
- Current load: 480 tons (96% of capacity)
- Operating within design parameters
- Efficiency curve still optimal at this load
Alternatives to consider:
- Monitor for further load increases
- Consider load reduction measures
- Plan for future expansion if load trends continue
Warnings:
- Limited capacity remaining for future growth
- No redundancy if unit fails during peak load

Expert Verification:

Verified by: John Smith, PE
Date: January 15, 2025
Notes: Verified chiller performance curves and load projections

Step 3: Run Automated Evaluations

Test AI responses against the bounded dataset:

Evaluation Process:

Parse the AI Response - Extract key elements from the answer
Compare to Expected Answer:
- Does the recommendation match?
- Is the reasoning sound and complete?
- Are alternatives provided?
- Are warnings included?
Calculate Score - Based on comparison results
Determine Pass/Fail - Score must be 75% or higher to pass

The Feedback Loop

The real power comes from the feedback loop:

1. AI Generates Response

The AI answers a query based on its current understanding and bounds.

2. Digital Twin Evaluation

The system automatically checks the response against known data and rules.

3. Expert Review

A domain expert reviews the response and provides feedback.

4. Learning and Improvement

The feedback is incorporated to improve future responses:

Improvement Cycle:

Before Changes:

Initial accuracy level
List of common errors

Changes Made:

Adjusted bounds and parameters
Added new rules
Incorporated additional examples

After Changes:

New accuracy level
Errors that were fixed
Remaining issues to address

Next Steps:

Planned improvements for the next iteration

Real-World Example: HVAC Troubleshooting

Initial Query:

"Why is Zone 3B running warm?"

AI Response (First Attempt - No Evaluation):

"The zone might be warm due to several reasons: thermostat issues, ductwork problems, refrigerant levels, or compressor malfunction."

Digital Twin Eval Score: 25% - Generic, not specific to the building

AI Response (After Evaluation Training):

"Zone 3B is running warm (current: 76°F, setpoint: 72°F) due to:

1. Primary Issue: VAV damper for Zone 3B (VAV-12) is stuck at 45% open (should modulate to 85% based on current demand)

2. Contributing Factor: Supply air temperature from AHU-02 is 58°F (2°F higher than setpoint of 56°F)

3. Root Cause: Damper actuator VAV-12-ACT last calibrated 18 months ago (maintenance interval: 12 months)

Recommended Actions: - Immediate: Manually verify VAV-12 damper operation - Short-term: Recalibrate or replace actuator VAV-12-ACT - Long-term: Review AHU-02 chilled water valve control

Expected resolution time: 2-4 hours Estimated cost: $450-$800 (parts + labor)"

Digital Twin Eval Score: 92% - Specific, actionable, bounded to actual assets

Expert Eval: Approved - "Correct diagnosis, appropriate recommendations"

Best Practices for Evaluation

1. Start Simple

Begin with straightforward scenarios before moving to complex edge cases.

2. Document Everything

Keep detailed records of evaluations, feedback, and improvements.

3. Regular Review Cycles

Schedule periodic reviews with domain experts (weekly or monthly).

4. Measure Progress

Track accuracy improvements over time:

Evaluation Cycle	Accuracy Score	Common Error Types
Initial Baseline	45%	Generic responses, missing context
After 1 Month	62%	Better asset identification
After 3 Months	78%	Improved reasoning chains
After 6 Months	87%	Reliable recommendations

5. Celebrate Wins, Learn from Failures

Both successful and failed evaluations provide valuable learning opportunities.

The Path to AI Maturity

Stage 1: Unbounded AI (0-30% accuracy)

Generic responses, not useful for real decisions

Stage 2: Bounded AI (30-60% accuracy)

Asset-aware, but still needs significant human review

Stage 3: Evaluated AI (60-80% accuracy)

Reliable for routine scenarios, expert review for complex cases

Stage 4: Expert-Level AI (80-95% accuracy)

Trustworthy recommendations, minimal review needed

Note: We'll never reach 100% because building operations involve judgment calls where multiple valid approaches exist.

Conclusion

Improving AI performance in building operations requires patience and a systematic evaluation process:

Bound your AI within a digital twin framework
Evaluate responses against known scenarios and real-time data
Involve experts to validate and provide feedback
Iterate continuously based on evaluation results
Treat AI as a junior engineer that's learning and improving

Remember: Good AI is not about getting perfect answers immediately—it's about building a system that gets better every day through rigorous evaluation.

"Be patient with your prompting and work with your AI like a new engineer." - Raj Setty

Interested in implementing an AI evaluation framework for your operations? Contact us to discuss how Syyclops can help.

Read the original LinkedIn post here.

Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review

Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review

Understanding AI Evaluations

The Two-Layer Evaluation Process

Layer 1: Digital Twin Evaluation

Layer 2: Expert Evaluation

Why This Two-Layer Approach Works

The AI as a Junior Engineer

Implementing the Evaluation Framework

Step 1: Define Evaluation Criteria

Step 2: Create Bounded Test Cases

Step 3: Run Automated Evaluations

The Feedback Loop

1. AI Generates Response

2. Digital Twin Evaluation

3. Expert Review

4. Learning and Improvement

Real-World Example: HVAC Troubleshooting

Initial Query:

AI Response (First Attempt - No Evaluation):

AI Response (After Evaluation Training):

Best Practices for Evaluation

1. Start Simple

2. Document Everything

3. Regular Review Cycles

4. Measure Progress

5. Celebrate Wins, Learn from Failures

The Path to AI Maturity

Stage 1: Unbounded AI (0-30% accuracy)

Stage 2: Bounded AI (30-60% accuracy)

Stage 3: Evaluated AI (60-80% accuracy)

Stage 4: Expert-Level AI (80-95% accuracy)

Conclusion

Ready to Get Started?