AI & ML

Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review

RS
Raj Setty
April 2, 2025
5 min read

Optimizing AI Performance: The Role of Evaluations in Digital Twins and Expert Review

Be patient with your prompting and work with your AI like a new engineer.

Understanding AI Evaluations

The term Eval or Evaluations refers to the process of assessing the performance and effectiveness of an AI model based on specified criteria. If you want to get better results out of your AI, then you need to "evaluate" its answer against a bounded data set within a digital twin.

These answers then need to be "evaluated" by a domain expert, like a consulting engineer.

The Two-Layer Evaluation Process

Layer 1: Digital Twin Evaluation

The first evaluation happens automatically within the digital twin framework, checking:

Asset Accuracy:

  • Which assets were referenced in the response
  • Whether the correct assets were identified
  • Accuracy score (0-100)

Data Accuracy:

  • Which metrics were used
  • Current values vs. historical baseline
  • Deviation score (0-100)

Rule Compliance:

  • Which operational rules were applied
  • Whether applications were correct
  • Compliance score (0-100)

Overall Evaluation:

  • Overall score (0-100)
  • Confidence level (high, medium, or low)

Layer 2: Expert Evaluation

The second evaluation involves human domain expertise:

Evaluator Information:

  • Name and credentials
  • Years of experience in the field

Assessment Criteria (1-5 scale):

  • Technical accuracy
  • Practical viability
  • Safety considerations
  • Cost effectiveness
  • Code compliance

Expert Feedback:

  • What was correct in the AI response
  • What was incorrect or missing
  • Missed considerations
  • Suggestions for improvement

Final Verdict: Approve, revise, or reject

Why This Two-Layer Approach Works

The AI as a Junior Engineer

Think of your AI assistant as a bright but inexperienced junior engineer. They have access to a lot of information and can make connections quickly, but they:

  • Lack real-world experience - Haven't seen what actually works in the field
  • Need guidance - Benefit from correction and feedback
  • Improve with practice - Get better as they learn from evaluations
  • Require supervision - Should have their work reviewed by experts

Just as you wouldn't let a brand-new engineer make critical decisions without review, you shouldn't blindly trust AI outputs without evaluation.

Implementing the Evaluation Framework

Step 1: Define Evaluation Criteria

For building operations, key criteria include:

Technical Criteria:

  • Equipment compatibility
  • Capacity matching
  • Energy efficiency
  • System integration

Practical Criteria:

  • Installation feasibility
  • Maintenance requirements
  • Spare parts availability
  • Operator training needs

Economic Criteria:

  • First costs
  • Operating costs
  • Lifecycle costs
  • Return on investment

Regulatory Criteria:

  • Building codes
  • Energy codes
  • Safety standards
  • Environmental regulations

Step 2: Create Bounded Test Cases

Build a library of known scenarios with verified correct answers:

Example Test Case: Office Building Cooling Load

Scenario:

  • Description: Office building cooling load increase
  • Building: Office-A
  • Chiller: Chiller-01
  • Current load: 450 tons
  • Design load: 500 tons
  • Question: "Our cooling load increased to 480 tons. Do we need a new chiller?"

Expected Answer:

  • Recommendation: No new chiller required. Current unit can handle 480 tons.
  • Reasoning:
    • Current chiller (Chiller-01) design capacity: 500 tons
    • Current load: 480 tons (96% of capacity)
    • Operating within design parameters
    • Efficiency curve still optimal at this load
  • Alternatives to consider:
    • Monitor for further load increases
    • Consider load reduction measures
    • Plan for future expansion if load trends continue
  • Warnings:
    • Limited capacity remaining for future growth
    • No redundancy if unit fails during peak load

Expert Verification:

  • Verified by: John Smith, PE
  • Date: January 15, 2025
  • Notes: Verified chiller performance curves and load projections

Step 3: Run Automated Evaluations

Test AI responses against the bounded dataset:

Evaluation Process:

  1. Parse the AI Response - Extract key elements from the answer
  2. Compare to Expected Answer:
    • Does the recommendation match?
    • Is the reasoning sound and complete?
    • Are alternatives provided?
    • Are warnings included?
  3. Calculate Score - Based on comparison results
  4. Determine Pass/Fail - Score must be 75% or higher to pass

The Feedback Loop

The real power comes from the feedback loop:

1. AI Generates Response

The AI answers a query based on its current understanding and bounds.

2. Digital Twin Evaluation

The system automatically checks the response against known data and rules.

3. Expert Review

A domain expert reviews the response and provides feedback.

4. Learning and Improvement

The feedback is incorporated to improve future responses:

Improvement Cycle:

Before Changes:

  • Initial accuracy level
  • List of common errors

Changes Made:

  • Adjusted bounds and parameters
  • Added new rules
  • Incorporated additional examples

After Changes:

  • New accuracy level
  • Errors that were fixed
  • Remaining issues to address

Next Steps:

  • Planned improvements for the next iteration

Real-World Example: HVAC Troubleshooting

Initial Query:

"Why is Zone 3B running warm?"

AI Response (First Attempt - No Evaluation):

"The zone might be warm due to several reasons: thermostat issues, ductwork problems, refrigerant levels, or compressor malfunction."

Digital Twin Eval Score: 25% - Generic, not specific to the building

AI Response (After Evaluation Training):

"Zone 3B is running warm (current: 76°F, setpoint: 72°F) due to:

1. Primary Issue: VAV damper for Zone 3B (VAV-12) is stuck at 45% open (should modulate to 85% based on current demand)

2. Contributing Factor: Supply air temperature from AHU-02 is 58°F (2°F higher than setpoint of 56°F)

3. Root Cause: Damper actuator VAV-12-ACT last calibrated 18 months ago (maintenance interval: 12 months)

Recommended Actions: - Immediate: Manually verify VAV-12 damper operation - Short-term: Recalibrate or replace actuator VAV-12-ACT - Long-term: Review AHU-02 chilled water valve control

Expected resolution time: 2-4 hours Estimated cost: $450-$800 (parts + labor)"

Digital Twin Eval Score: 92% - Specific, actionable, bounded to actual assets

Expert Eval: Approved - "Correct diagnosis, appropriate recommendations"

Best Practices for Evaluation

1. Start Simple

Begin with straightforward scenarios before moving to complex edge cases.

2. Document Everything

Keep detailed records of evaluations, feedback, and improvements.

3. Regular Review Cycles

Schedule periodic reviews with domain experts (weekly or monthly).

4. Measure Progress

Track accuracy improvements over time:

Evaluation CycleAccuracy ScoreCommon Error Types
Initial Baseline45%Generic responses, missing context
After 1 Month62%Better asset identification
After 3 Months78%Improved reasoning chains
After 6 Months87%Reliable recommendations

5. Celebrate Wins, Learn from Failures

Both successful and failed evaluations provide valuable learning opportunities.

The Path to AI Maturity

Stage 1: Unbounded AI (0-30% accuracy)

Generic responses, not useful for real decisions

Stage 2: Bounded AI (30-60% accuracy)

Asset-aware, but still needs significant human review

Stage 3: Evaluated AI (60-80% accuracy)

Reliable for routine scenarios, expert review for complex cases

Stage 4: Expert-Level AI (80-95% accuracy)

Trustworthy recommendations, minimal review needed

Note: We'll never reach 100% because building operations involve judgment calls where multiple valid approaches exist.

Conclusion

Improving AI performance in building operations requires patience and a systematic evaluation process:

  1. Bound your AI within a digital twin framework
  2. Evaluate responses against known scenarios and real-time data
  3. Involve experts to validate and provide feedback
  4. Iterate continuously based on evaluation results
  5. Treat AI as a junior engineer that's learning and improving

Remember: Good AI is not about getting perfect answers immediately—it's about building a system that gets better every day through rigorous evaluation.

"Be patient with your prompting and work with your AI like a new engineer." - Raj Setty


Interested in implementing an AI evaluation framework for your operations? Contact us to discuss how Syyclops can help.

Read the original LinkedIn post here.

Ready to Get Started?

Discover how Syyclops can help transform your infrastructure with digital twin technology and AI-driven insights.

Syyclops Digital Twin