Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review
Be patient with your prompting and work with your AI like a new engineer.
Understanding AI Evaluations
The term Eval or Evaluations refers to the process of assessing the performance and effectiveness of an AI model based on specified criteria. If you want to get better results out of your AI, then you need to "evaluate" its answer against a bounded data set within a digital twin.
These answers then need to be "evaluated" by a domain expert, like a consulting engineer.
The Two-Layer Evaluation Process
Layer 1: Digital Twin Evaluation
The first evaluation happens automatically within the digital twin framework, checking:
Asset Accuracy:
- Which assets were referenced in the response
- Whether the correct assets were identified
- Accuracy score (0-100)
Data Accuracy:
- Which metrics were used
- Current values vs. historical baseline
- Deviation score (0-100)
Rule Compliance:
- Which operational rules were applied
- Whether applications were correct
- Compliance score (0-100)
Overall Evaluation:
- Overall score (0-100)
- Confidence level (high, medium, or low)
Layer 2: Expert Evaluation
The second evaluation involves human domain expertise:
Evaluator Information:
- Name and credentials
- Years of experience in the field
Assessment Criteria (1-5 scale):
- Technical accuracy
- Practical viability
- Safety considerations
- Cost effectiveness
- Code compliance
Expert Feedback:
- What was correct in the AI response
- What was incorrect or missing
- Missed considerations
- Suggestions for improvement
Final Verdict: Approve, revise, or reject
Why This Two-Layer Approach Works
The AI as a Junior Engineer
Think of your AI assistant as a bright but inexperienced junior engineer. They have access to a lot of information and can make connections quickly, but they:
- Lack real-world experience - Haven't seen what actually works in the field
- Need guidance - Benefit from correction and feedback
- Improve with practice - Get better as they learn from evaluations
- Require supervision - Should have their work reviewed by experts
Just as you wouldn't let a brand-new engineer make critical decisions without review, you shouldn't blindly trust AI outputs without evaluation.
Implementing the Evaluation Framework
Step 1: Define Evaluation Criteria
For building operations, key criteria include:
Technical Criteria:
- Equipment compatibility
- Capacity matching
- Energy efficiency
- System integration
Practical Criteria:
- Installation feasibility
- Maintenance requirements
- Spare parts availability
- Operator training needs
Economic Criteria:
- First costs
- Operating costs
- Lifecycle costs
- Return on investment
Regulatory Criteria:
- Building codes
- Energy codes
- Safety standards
- Environmental regulations
Step 2: Create Bounded Test Cases
Build a library of known scenarios with verified correct answers:
Example Test Case: Office Building Cooling Load
Scenario:
- Description: Office building cooling load increase
- Building: Office-A
- Chiller: Chiller-01
- Current load: 450 tons
- Design load: 500 tons
- Question: "Our cooling load increased to 480 tons. Do we need a new chiller?"
Expected Answer:
- Recommendation: No new chiller required. Current unit can handle 480 tons.
- Reasoning:
- Current chiller (Chiller-01) design capacity: 500 tons
- Current load: 480 tons (96% of capacity)
- Operating within design parameters
- Efficiency curve still optimal at this load
- Alternatives to consider:
- Monitor for further load increases
- Consider load reduction measures
- Plan for future expansion if load trends continue
- Warnings:
- Limited capacity remaining for future growth
- No redundancy if unit fails during peak load
Expert Verification:
- Verified by: John Smith, PE
- Date: January 15, 2025
- Notes: Verified chiller performance curves and load projections
Step 3: Run Automated Evaluations
Test AI responses against the bounded dataset:
Evaluation Process:
- Parse the AI Response - Extract key elements from the answer
- Compare to Expected Answer:
- Does the recommendation match?
- Is the reasoning sound and complete?
- Are alternatives provided?
- Are warnings included?
- Calculate Score - Based on comparison results
- Determine Pass/Fail - Score must be 75% or higher to pass
The Feedback Loop
The real power comes from the feedback loop:
1. AI Generates Response
The AI answers a query based on its current understanding and bounds.
2. Digital Twin Evaluation
The system automatically checks the response against known data and rules.
3. Expert Review
A domain expert reviews the response and provides feedback.
4. Learning and Improvement
The feedback is incorporated to improve future responses:
Improvement Cycle:
Before Changes:
- Initial accuracy level
- List of common errors
Changes Made:
- Adjusted bounds and parameters
- Added new rules
- Incorporated additional examples
After Changes:
- New accuracy level
- Errors that were fixed
- Remaining issues to address
Next Steps:
- Planned improvements for the next iteration
Real-World Example: HVAC Troubleshooting
Initial Query:
"Why is Zone 3B running warm?"
AI Response (First Attempt - No Evaluation):
"The zone might be warm due to several reasons: thermostat issues, ductwork problems, refrigerant levels, or compressor malfunction."
Digital Twin Eval Score: 25% - Generic, not specific to the building
AI Response (After Evaluation Training):
"Zone 3B is running warm (current: 76°F, setpoint: 72°F) due to:
1. Primary Issue: VAV damper for Zone 3B (VAV-12) is stuck at 45% open (should modulate to 85% based on current demand)
2. Contributing Factor: Supply air temperature from AHU-02 is 58°F (2°F higher than setpoint of 56°F)
3. Root Cause: Damper actuator VAV-12-ACT last calibrated 18 months ago (maintenance interval: 12 months)
Recommended Actions: - Immediate: Manually verify VAV-12 damper operation - Short-term: Recalibrate or replace actuator VAV-12-ACT - Long-term: Review AHU-02 chilled water valve control
Expected resolution time: 2-4 hours Estimated cost: $450-$800 (parts + labor)"
Digital Twin Eval Score: 92% - Specific, actionable, bounded to actual assets
Expert Eval: Approved - "Correct diagnosis, appropriate recommendations"
Best Practices for Evaluation
1. Start Simple
Begin with straightforward scenarios before moving to complex edge cases.
2. Document Everything
Keep detailed records of evaluations, feedback, and improvements.
3. Regular Review Cycles
Schedule periodic reviews with domain experts (weekly or monthly).
4. Measure Progress
Track accuracy improvements over time:
Evaluation Cycle | Accuracy Score | Common Error Types |
---|---|---|
Initial Baseline | 45% | Generic responses, missing context |
After 1 Month | 62% | Better asset identification |
After 3 Months | 78% | Improved reasoning chains |
After 6 Months | 87% | Reliable recommendations |
5. Celebrate Wins, Learn from Failures
Both successful and failed evaluations provide valuable learning opportunities.
The Path to AI Maturity
Stage 1: Unbounded AI (0-30% accuracy)
Generic responses, not useful for real decisions
Stage 2: Bounded AI (30-60% accuracy)
Asset-aware, but still needs significant human review
Stage 3: Evaluated AI (60-80% accuracy)
Reliable for routine scenarios, expert review for complex cases
Stage 4: Expert-Level AI (80-95% accuracy)
Trustworthy recommendations, minimal review needed
Note: We'll never reach 100% because building operations involve judgment calls where multiple valid approaches exist.
Conclusion
Improving AI performance in building operations requires patience and a systematic evaluation process:
- Bound your AI within a digital twin framework
- Evaluate responses against known scenarios and real-time data
- Involve experts to validate and provide feedback
- Iterate continuously based on evaluation results
- Treat AI as a junior engineer that's learning and improving
Remember: Good AI is not about getting perfect answers immediately—it's about building a system that gets better every day through rigorous evaluation.
"Be patient with your prompting and work with your AI like a new engineer." - Raj Setty
Interested in implementing an AI evaluation framework for your operations? Contact us to discuss how Syyclops can help.
Read the original LinkedIn post here.