Copy evaluation

LLMs in creating marketing communication
In recent discussions across the AI and marketing communities (https://every.to/also-true-for-humans/ai-focus-groups-are-a-step-on-the-path-to-superhuman-advertising), a recurring theme emerged: that the text generated by large language models (LLMs) often lacks the clarity or persuasive power needed for high-performing website copy. However, it is impossible to move forward in this aspect without specific instructions for LLMs on how to produce good copy and, most importantly, how to evaluate the results with precision.
At Pagent, we specialize in creating website variations designed to significantly outperform a customer’s existing site. To achieve the highest possible quality, we follow a structured approach to hypothesis testing in every aspect of our system and give instructions to the LLM according to well-defined logic and criteria. The same rigorous approach applies to the generation and evaluation of the final copies produced:
- We must ensure that the quality of the generated texts does not deteriorate compared to those on the existing website in terms of grammatical correctness, sentence structure complexity and overall readability.
- All important terminology and facts critical to the customer must be preserved perfectly.
- The style and tone of the newly generated copies must be consistent with the customer’s style and tone, unless a deliberate change is required according to the chosen marketing strategy.
Methodology for copy evaluation : traditional NLP vs LLM
To perform the evaluation of copy quality we can utilise multiple technologies. In this section we will focus on two methods that we use for the evaluation of texts: Traditional natural language processing (NLP) and Large language models (LLM).
Traditional NLP systems can provide many useful techniques for text evaluation. However, they often rely heavily on the rules, heuristics and pre-trained machine learning models. For our constantly evolving evaluation system this might be costly in terms of time and effort.
Therefore, we only employ traditional NLP methods to produce set of quality metrics that are universal across domains and use cases, and are generally indicative of the quality and readability of the text. It allows us to assess correspondence of the texts to the common best practices like information density (amount of “filler” words), sentence complexity, etc.
We then use large language models (LLM) to compare a set of original texts with the version that we created. This allows us to tackle more subjective and variable criteria of evaluation, such as preservation of tone, terminology and effectiveness of call-to-action. Also, in Pagent we often utilise the ability of LLMs to produce reasoning for their responses. In such a case, if the LLM indicates us a degradation of quality that we should pay attention to, we always know exactly which part of the text it refers to.
General copy quality metrics using NLP
In this section, we will provide a non-exhaustive list of the metrics we use to evaluate each set of copies. These metrics are computed using traditional natural language processing systems and do not rely too heavily on the specifics of the domain or use case. They also have the advantage of being produced consistently; unlike LLMs, where there is a certain randomness in each output.
The main factors underlying the metrics we focus on are
- Simplicity (we prefer copies that express simpler and shorter)
- Diversity (we prefer sets of texts that have higher diversity within a set, so to avoid duplications and repetitive thoughts)
- Information density (we prefer copies with more meaning conveyed into short passages of text).
We utilise the following set of metrics to access the aforementioned factors:
1. Complex sentence ratio
We count the number of subordinate clauses and complex conjunctions in a text to determine the number of complex sentences it contains. The idea is that texts with fewer complex sentences are easier to read and thus preferable. We call this metric the ‘simple sentence ratio’ – the higher it is, the better.
-
Good Copy (Simple sentence structure)
“Manage projects. Keep your team aligned. Communicate without friction.”
(Clear, direct, easy to read)
-
Bad Copy (Overly complex sentence)
“With our platform, you can manage projects efficiently while also ensuring that your team remains on track and communication flows smoothly across departments”
(Convoluted, uses multiple clauses)
2. Adverb ratio
This metric is inspired by one of the best practices in creative writing in general: to avoid lots of adverbs. This is especially relevant for landing pages, where we want to convey the information in the most concise way possible. We use non-adverb ratio (simple inverse) in our evaluation - the higher this metric, the better.
-
Good Copy (Minimal adverbs)
“Streamline daily operations with powerful automation.”
-
Bad Copy (Adverb-heavy)
“Our tool dramatically improves how easily you can effortlessly manage daily operations.”
(Bloated with unnecessary adverbs)
3. Pairwise semantic diversity
This metric allows us to access the diversity within a given set of texts. The reasoning behind this metric is that the texts with more semantic diversity between each other will be more effective, as they avoid repetitive thoughts and convey more information. We use semantic embeddings to calculate this metric.
- Good Copy Set (Diverse messages):
- “Plan sprints and assign tasks in seconds.”
- “Visualize your progress with interactive boards.”
- “Sync deadlines across teams automatically.”
- “Get real-time notifications for key updates.”
- “Export custom reports with one click.”
(Each sentence introduces a distinct functional benefit)
- Bad Copy Set (Semantically redundant):
- “Work smarter with our intuitive tools.”
- “Do your best work, every day.”
- “Smarter tools for better work.”
- “Help your team do great work.”
- “Achieve better results, faster.”
(All hover around the same vague idea of “working better”)
4. Information density
Related to the adverb ratio, this metric measures the proportion of content words - such as nouns, verbs, and adjectives - compared to the total word count. The reasoning behind this metric is similar to the previous ones - we prefer copies that deliver more dense information in the shorter amount of text.
-
Good Copy (High density)
“Time tracking, invoicing, and project analytics—designed for consultants”
(Content-rich with nouns and verbs)
-
Bad Copy (Low density)
“Our software really helps you get things done in an easier and more efficient way, no matter what kind of work you do”
(Vague, light on content words)
General metrics - example
Now that we had a look on the main metrics, we want to demonstrate the functionality of the copy evaluation system.
Consider the following set of copies (good):
-
“Start your free trial—no credit card needed.”
-
“Track billable hours with one-click timers.”
-
“Generate reports, invoices, and client summaries instantly.”
-
“Stay on schedule with automated reminders.”
-
“Join over 50,000 professionals using our platform.”
(Each is concise, adverb-light, semantically distinct, and information-dense.)
the second one (worse)
- “Sign up today and see what our platform can do for you.”
- “We help you manage your work more easily and efficiently.”
- “Our tools are designed to help teams of all sizes do their best work.”
- “Make your workflow better with features that improve productivity.”
- “Join thousands of happy users who are getting more done every day.”
and the third one (really bad)
-
“If you are someone who happens to be looking for a way to maybe do better work, then perhaps you might want to check us out.”
-
“We really, truly help you work faster in many really awesome ways.”
-
“Work better, work faster, improve work, do more work, and get work done.”
-
“Our service is kind of like a thing that you use to do things.”
-
“Experience extremely helpful features that help you do tasks quite easily and quickly.”
(Wordy, vague, adverb-heavy, semantically redundant, and lacking density.)
Our copy evaluation system produces the score of 59% for the first set, 50% for the second and 36**%** for the third set.
Metric | First set (good) | Second set (worse) | Third set (bad) |
---|---|---|---|
non-adverb density | 98% | 93% | 79% |
information density | 72% | 50% | 46% |
simple sentence ratio | 40% | 40% | 20% |
semantic diversity | 72% | 60% | 60% |
Primarily, the metrics like non-adverb density, simple sentence ratio and information density are driving the metrics for worse variations lower. We can use this evaluation to highlight the degradation of quality - if we observe such a deviation, we would not pass such a set of copies to the A/B test.
Copy evaluation using LLM
The metrics described above can help us to estimate how concise, diverse and informationally dense are the produced copies. Besides this independent evaluation, we also want to make sure that more subjective - but equally important - qualities of the original copy are preserved or improved. These aspects mainly include:
- produced copies should not alter the style and tone of the customer’s page
- produced copies should not oversimplify or damage page-specific terminology
- produced copies should not reduce the efficacy of the call-to-action elements
Large language models are particularly useful for such an evaluation, particularly because the rules for each assessment can be defined in prompts. This approach is also flexible as we can add new rules easily. The example of the evaluation rule looks like this (for Tone & Voice consistency metric):
- Does the tone shift? Is the rewrite too casual, too formal, or too inconsistent with brand voice? Note mismatch in tone, excessive hype, or unnecessary changes in formality.
We experimented with different approaches of asking LLM to produce the evaluation to find out which one works best.
The first approach is to ask LLM to produce numerical scores for each evaluation. However, this does not work well - the scores feel random and are not explainable. This is a common problem when of asking and the LLM to produce any kind of numerical assessment.
We found that the most practical approach would be to request LLM to output a set of errors of different severity, where the definitions of severity levels for each metric are also defined. One of the main advantages this approach is that each error is associated with a particular pair of copies (original / rewriting), so we can easily interpret the results of the LLM and, which is no less important, to act on them.
Below is an example of the error of high severity which was highlighted as the important piece of terminology was altered during the rewriting
Original copy ”…trusted by thousands of SEOs and SEO agencies worldwide for technical SEO audits. ”
Rewritten copy ”…relied upon by countless SEOs and agencies globally for technical SEO evaluations.”
Detected error of high severity The term ‘technical SEO audits’ was changed to ‘technical SEO evaluations’, which may weaken the SEO relevance.
As you can see, we have a direct reference to the element, as well as the LLM reasoning why it decided to highlight this error.
Using evaluation results : from Monitoring to Self-Correction
There are multiple avenues where we can utilise the results of both NLP-based and LLM-based copy evaluation. In this section we will highlight the most important of them.
Ongoing quality monitoring
We utilise the scores produced by the NLP-based evaluation and the number of severe errors that are highlighted by an LLM-based evaluation for the ongoing monitoring of the copy quality - the more we improve our system, the higher scores and the less amount of errors we expect.
Guardrails
In pagent we strive to guarantee that the variations produced by the system are of high quality and do not violate any customer-specific rules nor degrade the original quality of the customer’s website. When producing the variation, we perform the copy assessment and trigger the issue if the quality score degrades too much.
Self-correction
Related to the previous use case, not only we detect the errors and score degradations but also use this information to iteratively produce better variation. For example, if the high-severity error is introduced, we instruct our system to make a correction in the corresponding element.
Customer page assessment
Last but not the least use case of using the evaluation metrics is the independent assessment of the original customer page. As our metrics are based on common best practices, the produced scores are transformed to specific recommendations and sent to the user as part of our page assessment reports.
Transparent AI Optimization: laying the groundwork
The area of the LLM text evaluation is still in it’s early stages and we think that the modern AI community can leverage many methods from traditional natural language processing to perform the unbiased assessment which are reproducible and explainable. Which is exactly what we are striving to build with pagent - the mutually benefitting combination of powerful LLM generation with traditional explainable methods, with the latter bringing the control and stability to the former. In the next article we will explore how do we use the same principles in ongoing page optimization process.