Evaluating AI – Clarilis asks, what about drafting tools?

Clarilis | Resource | 21 March 2025

Over the last few weeks, we’ve seen a growing focus on how we evaluate legal AI tools. The latest contribution is the Vals Legal AI Report, which gives an in-depth study of four popular legal AI tools across tasks like data extraction, summarisation, and chronology creation.

This is very welcome: the pace of development of AI tools shows no signs of slowing down and many law firms are grappling with which solutions to spend time, energy, and money on.

However, less has been written about evaluating AI legal drafting. With that in mind, we wanted to explore the challenges involved and share some practical steps for effective evaluation of AI drafting tools.

Why do we evaluate AI?

Before we start, it’s worth setting out the different elements of generative AI systems and, more importantly, why you might want to evaluate them:

Models. AI providers (like OpenAI, Anthropic or DeepSeek, etc.) frequently release new AI models (LLMs) and capabilities. These LLMs are the workhorses of most AI legal tech tools, and many platforms give you a choice of the LLM you want to use. Therefore, understanding how these different models perform is crucial. Inevitably, new models will be released with a fanfare of claims about improved accuracy and performance. But that might not always translate into better results for your task. For example, is gpt-4o better at text generation than gpt-4, or does it just have a wider use case? Which model is right for you? And will that change over time?
Prompts. Working with generative AI tools often involves developing prompts (instructions) to make the tool produce the desired output. For example, deploying a generative AI review tool might involve using provided prompts and/or setting up your own library of prompts tailored to a particular practice area’s work. In this context, you’ll want to evaluate how different prompts perform: which prompts should you share within a team? Does changing the prompt make the output better? Are the built-in prompts suitable?
Products. At the highest level, you will want to evaluate whether a product helps you achieve your business objective: does it save you time? Or enable a new way of working? Will the team engage with the product? As all product demos should look amazing, getting a true end-to-end view of a product usually requires a pilot process. Since most teams can only conduct a few pilots, maximising the pilot’s value is crucial.

We’ll come back to these different elements as we explore both the challenges of evaluating AI and how you can practically approach the task.

Evaluation challenges

If you’ve spent time looking at generative AI tools, you will know that evaluating these different aspects above can be difficult and time-consuming. Whether it’s a new model or an entire product, it’s easy to spend days or even weeks trying it out without reaching a firm conclusion.

Much of the challenge comes from the way generative AI works. Let’s consider some of the common characteristics of generative AI tools:

Characteristic	Evaluation consideration
Output variability: When you evaluate a tool, you typically aim to test and change only one variable at a time. With a generative AI tool, that will often be the input prompt. However, LLMs can produce different outputs from exactly the same input, making it difficult to evaluate the cause of a change. You can mitigate this by setting the “temperature” of the LLM to zero (if you have access to it), but it doesn’t eliminate it entirely.	If you want to have confidence in how an LLM will behave, you might need to test it multiple times on the exact same task (in contrast to how you might test other software).
Open-ended inputs: Many AI tools allow the user to provide open-ended inputs. For example, a paragraph of instructions about a clause to be generated, or an arbitrary document to be analysed.	Given a user can input almost anything they like, it will be impossible to test all inputs. Instead, you could aim to test a representative set of inputs or tasks. Consider constraining a tool’s use to particular practice areas or use cases and carrying out further evaluation before introducing it to a new team.
Unstructured outputs: AI tools are often used to generate new, novel text (e.g. new drafting or a research memo). This is in contrast to traditional tools which, for example, might extract existing text from a document, or provide drafting using explicit rules and approved language.	Evaluating the quality and accuracy of AI outputs might require human judgement and, in legal contexts, specialist expertise. Evaluation can, therefore, become a time-consuming and expensive process.

Taken together, we can see that evaluating generative AI tools is quite different to other legal tech. These challenges are compounded by the way we currently work and build with generative AI. As we touched on earlier:

Models change: The performance of a generative AI tool will vary depending on: (i) the choice of LLM provider (e.g. OpenAI vs Anthropic); (ii) the model version (e.g. GPT-4 vs GPT-4o); and (iii) the model’s other parameters (e.g. “temperature”, which controls how variable the outputs are). In principle, any testing and evaluation you do with an AI tool needs to be repeated if you change the model or configuration.
Prompts change: Working with generative AI tools often involves experimenting with different prompts to achieve the desired output. For teams deploying AI tools, these prompts are typically re-usable templates which can then be applied in different circumstances by end users. For example, you might write a set of instructions that captures your firm’s house style which users can include when interacting with the AI tool. At the outset of a project, a wide number of approaches can be tried, and it is easy to tell which is best (“Use professional language” vs “Use English (UK) spelling and grammar, be professional and avoid legalese”). However, as the prompt is refined, changes become more marginal, and it becomes harder to judge whether the new prompt is an improvement. Is the output always better? Or is it a fluke of the examples you are testing? This can be challenging to assess even for subject matter experts.

In short, choosing models, writing prompts, and committing to new generative AI products requires lots of experimentation – but generative AI is particularly difficult to test in a timely and efficient manner.

Public and Community AI Benchmarks

We’ve explored the challenges of evaluating generative AI. How can you start to tackle it?

Various public and community efforts are underway to help with evaluation. At the model level, vals.ai publishes a range of benchmarks which aim to test popular models against legal tasks and, more recently, an in-depth study of four popular legal AI tools. A number of firms have also published their findings, often focusing on a particular legal task, like due diligence or legal question answering.

It makes sense to leverage this work wherever possible, and it’s particularly helpful for informing a wider strategy (as discussed in our 10 principles, we think it’s important for firms and vendors to stay flexible in which providers they use). However, for the time being, firms will still need to come to their own views on which products are right for them and how they should be configured.

Establishing an evaluation framework

So, what are some approaches to carrying out your own generative AI evaluation? The key is to leverage expert legal knowledge and to aim for an objective measure.

Relevance, accuracy, and completeness

A common starting point when building an evaluation framework is to think about AI outputs in terms of their relevance, accuracy, and completeness:

Relevance: Is the output relevant to the target task? Or does it miss the point?

Accuracy: How accurate is the output? Are there any material mistakes?

Completeness: How complete is the output? Does it miss anything that you would expect to be covered?

(You might also capture things like the appropriateness of the style and tone.)

Using this scheme, you could ask expert lawyers to review a set of AI outputs and score them against these criteria.

This provides an objective measure (you can compare the scores across practice areas or between different tools) but note that the scoring itself is often subjective. Different lawyers might give different scores to the same output depending on their interpretation, their general disposition, and what they had for breakfast that morning. Therefore, this approach works best where the same lawyers can be called on to consistently evaluate future outputs.

What does good look like?

Rather than jumping into a review of AI outputs, you can cut-through some of this subjectivity by asking yourself “What does good look like?”.

Take, for example, a generative AI tool which provides lawyers with draft language for use in contracts. You would start by picking some tasks across a range of practice areas and of different levels of complexity (e.g. draft some software warranties for use in a share purchase agreement, draft a premises licence condition for a commercial Real Estate transaction).

Having identified the tasks, you can then set out for each the key points you would expect to see in the output.

For example, in an arbitration clause you would expect the drafting to specify the rules of arbitration and the number of arbitrators etc. In a note of advice on directors’ duties, you would expect to see reference to section 172 of the Companies Act.

The goal is not to create an exhaustive, granular list, but to establish a rough marking scheme and to take as much subjectivity out of the process as possible. What are the legal points which, if missing, would raise a red flag? What sources would you expect to see cited? If you made other content available to the product, how would you expect to see it re-used?

Clearly, setting up these examples requires time and focused attention from lawyers with relevant expertise (we’re fortunate to have a dedicated PSL team at Clarilis). But by giving some thought to this up front, you start to create a structured framework which, we think, has several advantages:

It focuses on value. Explicitly considering what a user would want helps keep us focused on the workflow and problem you are trying to solve. You are evaluating against your needs and expectations, not an abstract standard. Note that you might not necessarily expect a perfect output – it could be a good first cut to get the user started.

This is particularly powerful when you frame the requirements for specific types of user. Will the output be used by someone with the expertise to critically evaluate it? If it is to be used by less experienced users, you might set higher standards and expect more in terms of guidance.

It highlights what’s not there. Even for experts, one of the hardest things to do when reviewing work is to identify what’s not there. This is especially relevant for AI generated content where the quality and confidence of the prose can hide other inadequacies. A pre-defined evaluation scheme can help avoid tunnel vision.
It promotes consistency. If you later evaluate alternative tools, you can apply the same criteria to ensure comparisons are more meaningful. A structured framework also makes it easier to articulate why an output is or isn’t useful, which leads to more productive discussions within a team.

With your set of tasks and expected outcomes in place, you can then evaluate the AI outputs against the scheme: Does the generated arbitration clause include all the points you anticipated? The result is an objective measure of the model, prompt or tool’s performance which you can reason about and act on.

The snake eating its tail

So far, the processes we’ve discussed have all involved a significant time commitment from experienced lawyers. One idea to help with this is to use AI to carry out some of the evaluation. On the face of it, this might sound foolish; like asking a student to mark their own homework. But it is more plausible than it first seems, particularly if you adopt the “what does good look like” framework. In practice, the steps look like this:

An expert lawyer sets out the task and the marking scheme (what you would expect a good output to include).
You generate an output from the prompt or tool you are evaluating.
You give the output and the marking scheme to another AI, and have it carry out the evaluation. This is a plausible thing to do because evaluating an output against a marking scheme is generally simpler than producing the output itself.

It’s easy to see how this approach can support the evaluation process. Whether you’re a law firm or a legal tech, if you have access to both legal and technical expertise you can run this evaluation automatically across thousands of inputs. This allows you to quickly and efficiently assess the consistency of a tool’s output, or whether a new prompting approach improves performance.

Of course, this is not a fool-proof approach. It works well for highlighting whether something is present in the output, but may struggle with finer points of construction and detail. For example, suppose our AI tool has suggested an arbitration clause along the following lines:

“1. Arbitration

1.1 Any dispute arising out of or in connection with this contract, including any question regarding its existence, validity or termination, shall be referred to and finally resolved by arbitration under the LCIA 2014 Arbitration Rules.

…”

An AI evaluation would likely “pass” this clause against the requirement of “specifying the rules of arbitration”. However, an expert reviewer might reject the choice of rules: why the 2014 rules and not the 2020 rules? Is the LCIA an appropriate choice in this context?

So, AI evaluation needs both technical expertise to implement, and legal expertise to use critically and appropriately. It can tell you if you are in the right ballpark but won’t wholly replace the role of an expert reviewer. This is part of the blended, expert-led approach we adopt at Clarilis.

One-shot vs Interactivity

The other limitation of these evaluation approaches is that it’s harder to capture the wider behaviour of a tool. Many tools are designed to be used interactively: they expect a user to build and refine outputs within the system. For example, the clause generation product might not be designed to get to the output in one-shot. Instead, it might start by surfacing a range of issues and sources before asking the user which they want to develop further.

Getting a picture of a product’s end-to-end performance requires us to go back to the wider problem you are trying to solve (does it help the user complete the task more efficiently? Or to a higher quality?). Often, the only way to form that view will be to run a side-by-side comparison. As we noted at the start, given that most teams will only have the bandwidth to conduct a few of these comparisons, leaning on benchmarks can help ensure you are picking the most promising products to take forward.

Final thoughts

Evaluating AI outputs is a key step in making decisions about which tools to deploy and how to configure them. It’s harder to evaluate AI outputs than it is to assess traditional legal tech, but putting some structure around the process is a more effective use of time and helps supports more robust conclusions. In a world where it is too easy to generate plausible sounding legal content, asking “What does good look like?” before you start can help to ground your evaluation.

At Clarilis, we focused on expert-led evaluation in the development of Clarilis AI Draft. We’d love to hear more about your experiences of AI evaluation – please feel free to reach out!

Clarilis

Contact:

Paul Dansey

08456 800 378

Clarilis’ intelligent drafting solution: increase capacity, improve turnaround time and free lawyers to focus on higher value tasks.