Evaluations and Approval

Purpose of Evaluations

Evaluating GenAI pipelines is key to making sure they generate reliable, high-quality responses. Without a solid evaluation process, it is hard to tell if changes—like tweaking prompts, removing LLMs, adjusting parameters, or refining retrieval steps—are actually improving performance or breaking something. By measuring factors like relevance, hallucination rates, and latency, teams can make informed decisions about how to optimize their pipelines. Integrating evaluations into CI/CD pipelines ensures that every update is tested, so performance stays consistent and issues are caught early.

The quality of an evaluation depends on having a well-rounded dataset and metrics. If test cases are too limited, models might appear to perform well but fail in real-world scenarios. A diverse dataset ensures that evaluation metrics truly reflect how the model will behave in production. At the end of the day, evaluations help build trust, keeping LLM applications accurate, scalable, and dependable across different use cases.

Choosing the Right Evaluation Method

There are two main types of evaluators for assessing LLM performance: automated evaluation (using LLMs or code) and human annotations. Each method serves a different purpose depending on the type of assessment needed.

Method	How it Works	Best For
Automated (LLM or Code-Based)	Uses an LLM to evaluate another LLM’s output or code to measure accuracy, performance, or behavior.	- Fast, scalable qualitative evaluation. - Reducing cost and latency. - Automating evaluations with hard-coded criteria (e.g., code generation).
Human Annotations	Experts manually review and label LLM outputs.	- Evaluating automated evaluation methods. - Applying subject matter expertise for nuanced assessments. - Providing real-world application feedback.

Automated evaluation is efficient for objective assessments and large-scale testing, while human annotations provide deeper insights at a higher cost. Combining both methods can ensure a balanced and reliable evaluation process.

GenAi Evaluation Framework

Gen AI pipelines can introduce significant risks, making a robust evaluation framework essential for comprehensive testing and validation. Follow these structured steps to ensure effective evaluation:

Step	Description
Identify Risks & Define Evaluation Scope	- Assess potential risks across the pipeline. - Create a comprehensive list of necessary evaluations.
Define Evaluation Metrics	- Align metrics with business objectives, MRM, and FL requirements. - Implement custom metrics as needed. - Use GGX evaluation reports curated by GenAI and risk experts.
Prepare Evaluation Datasets	- Ensure datasets accurately represent the use case. - Cover all critical business scenarios for thorough validation. - Utilize GGX datasets curated by experts to evaluate risks.
Run Evaluations	- Use standard/custom reports and dashboards for structured testing. - Interpret evaluation results to refine and enhance the pipeline.
Compare with Challengers	- Establish alternative components and pipelines for comparison.

By following these steps, teams can systematically evaluate their Gen AI pipelines, mitigate risks, and enhance performance with data-driven insights.
Corridor provides the ability to run evaluation jobs for registered GenAI components.

Introduction to Jobs

Corridor provides the ability to perform evaluations by running jobs on the registered objects and generating standardised and customized reports/metrics. Evaluations are controlled by Corridor and run in a dedicated locked environment to ensure reproducibility of results.

The platform supports batch evaluation of objects through simulation jobs or comparisons with similar objects on given datasets. Reports and metrics for these evaluations can be customized within Corridor under Resources → Reports Section. For more details, refer to the Reporting section.

Once the job is completed, Corridor records all the details about the specific steps of the jobs, logs resource usages and publishes the dashboard containing all the selected reports which can be shared across the team on Corridor for feedback and approval process. The results can be exported outside Corridor or used for automated documentation.

Approvals post Evaluations

The approval process is a key governance capability of the platform. After the object is fully evaluated using automated dashboard or manual tests, all the evaluation results can be shared with predefined roles within an Approval Workflow, ensuring structured reviews, feedback, and approvals for production use.

Once approved, the object is locked, preventing any modifications within the system. This guarantees the artifact remains unchanged before being exported directly to the production system.

Standardized GGX Reports

Apart from the ability to create customized reports, Corridor already has a suite of tests registered for evaluation and validation of different components of Corridor. GGX Reports are crafted by a team of generative AI risk experts following thorough research and analysis. The list of reports is expanding with all the latest developments in the GenAI industry. All these reports can be used if applicable and can be maintained or forked for a specific use case.

Component	Test Name	Risk Attribute	Description	Type	Classification	Response
Pipeline	Accuracy	Inaccurate Classification or Response	Evaluate ability to correctly classify utterances or provide accurate responses to user queries.	MRM	Yes	Yes
Pipeline	Stability Repeated Utterances	Output Variability	Evaluate the ability to produce the same classification or nearly similar responses for repetitions of the same utterance.	MRM	Yes	Yes
Pipeline	Stability Perturbed Utterances	Output Variability	Evaluate the ability to produce the same classification or similar responses across minor variations of an utterance (e.g., synonyms, grammatical mistakes).	MRM	Yes	Yes
Pipeline	Bias (Comparative Prompt Analysis)	Implicit / Explicit Bias	Evaluate bias in outputs (e.g., classification accuracy or response accuracy) based on inferred segments for gender, race, and age.	Fair Lending	Yes	Yes
Pipeline	Vulnerability	Prompt Injection & Prompt Leakage	Evaluate the LLM pipeline’s resilience against jailbreak methods for out-of-context or ambiguous inputs.	MRM / Fair Lending	No	Yes
Pipeline	Toxicity	Hate Speech, Toxic Language, Sarcasm	Evaluate the LLM’s avoidance of generating or repeating toxic language (e.g., inappropriate, offensive, or harmful language).	MRM / Fair Lending	No	Yes
Pipeline	Faithfulness	Hallucination, Inaccurate Facts	Evaluate adherence to provided context without hallucination.	MRM / Fair Lending	No	Yes
Prompt	Prompt Trust Score	Prompt Quality, Prompt Vulnerability and Leakage	Evaluate the prompt quality in different dimensions like Grammar, Logical Coherence, Toxicity, Bias, etc., and generate a Trust Score.	MRM / Fair Lending	N/A	N/A
Prompt	Prompt Classification Accuracy	Inaccurate Classification	Evaluate the classification accuracy of the prompt on sample data to help in the hill-climbing process.	MRM / Fair Lending	Yes	No
LLM	Vocabulary Understanding	Misinterpretation, Lack of Context Awareness	Determine which LLM is best at defining financial services-specific terminology across seven context categories.	MRM / Fair Lending	N/A	N/A
LLM	Subject Understanding	Context Misalignment, Incorrect Reasoning	General subject understanding across Math, Science, etc.	MRM / Fair Lending	N/A	N/A
LLM	Reasoning Capability	Logical Fallacies, Incorrect Inferences	Assess LLM’s capabilities such as complex reasoning, knowledge utilization, language generation, etc.	MRM / Fair Lending	N/A	N/A
LLM	Toxicity Understanding	Failure to Detect Harmful Requests	Evaluate a model’s ability to identify and classify text statements that could be considered toxic across six toxicity labels.	MRM / Fair Lending	N/A	N/A
LLM	Toxicity Evaluation	Hate Speech, Toxic Language, Sarcasm	Evaluating the tendency of LLM generating toxic replies.	MRM / Fair Lending	N/A	N/A
LLM	Dialect Bias	Underrepresentation of Linguistic Variants	Assess which LLM responds most consistently to general queries phrased in different language dialects.	MRM / Fair Lending	N/A	N/A
LLM	Gender Bias with Income as Proxy	Fair Lending Risk	Evaluation focuses on understanding if there is any systematic bias in assigning job titles to different genders and profiles.	MRM / Fair Lending	N/A	N/A
LLM	Model Latency	Scalability Issues, User Frustration	Determine which LLM has the best response latency performance when varying prompt and response length.	Business / Tech	N/A	N/A
RAG	Retrieval Accuracy	Incorrect context retrieval, hallucination	Accuracy of the retrieved documents by the RAG system.	MRM	N/A	N/A
RAG	Knowledge Evaluation	Lack of Coverage	Coverage of different scenarios and business flows in Knowledge Data.	Business	N/A	N/A
RAG	Validation Data Evaluation	Lack of Coverage, Lack of Potential Scenarios	Coverage of different scenarios and business flows in Evaluation Data.	MRM	N/A	N/A