Gen AI and Measurement (2024)

Published in

Bootcamp

4 min read

Apr 30, 2024

2024 is the year of Taylor Swift and Gen AI/ML — it’s mainstream, its popular and everyone is talking about it. I can’t claim to be a Swiftie* so I will stick to the latter topic.

Machine Learning models and principles are finding its way into various products starting from consumer focused conversational modules (example: AI based chat for customer service) and enterprise use-cases (data classification and search/retrieval). While I see numerous articles and papers covering the fundamentals of what Gen AI or LLM models can and cannot do; I have been digging into the next obvious topic: Measurement

Investing in ML based product solutions is typically a high resource consuming approach — this is because investing in ML forward approaches are not business critical (yet) but is still considered a “good to have” in most cases. For example consider this use-case: Developers spend time troubleshooting bugs; they parse through help center articles or code libraries — an LLM based Gen AI module could generate relevant information in seconds saving developers time and can lead to faster diagnostics. Is this a more efficient/relevant way leading to faster diagnostics? Yes. But is it business critical? Probably not because there are alternative/manual ways of achieving the same outcome.

Which is why any AI based product approach needs a clear story to be able to justify the investment in product, engineering, computing resources:

Why do we need an AI approach? Are there alternate ways to achieve the same outcome? Another way to think about this is — is there an actual use-case or benefit OR is this a vanity/resume building project?
What are the success metrics post implementation? Is it time saved? Is it directly related to business metrics example revenue generated or new user acquisition?
Is there a strategic goal or benefit to implementing an ML solution?

Fundamentally, the quality of GenAI or LLM or ML technique implemented is directly tied to the end user experience. For example: A paralegal needs to consume hundreds of pages of text and manually file documents into themes (class action lawsuits, appeals etc). The use case for ML here is building a data classifier which would a. parse text from uploaded documents b. categorize and label/theme the documents without manual intervention. The end user experience depends on how accurately the model is able to categorize the documents into the correct categories. If the model isn’t able to categorize accurately — the paralegal would need to validate and re-label the documents leading a less than ideal user experience.

The quality of Gen AI applications can be evaluated broadly by using the following parameters:

a. Accuracy: Consider this input “Who are the US presidential candidates in 2024?” The quality of the output needs to be 100% factual which makes the response relevant and accurate (No, Kanye West is not a factual response to this query)

b. Anticipatory or Conversational: A good quality response generation will typically factor in the intent of the query and the conversation history. The response generated should be a “human-like” conversational dialog. A less than ideal response does not factor in context from past queries

c. Useful: Does the response provide any value to the user? For example, if a developer wants to quickly find the relevant source code for debugging, does the system accurately source and display the relevant code? Or does the developer have to parse through code libraries to find the information they are looking for?

e. Speed (Performance): I recently tested a crypto chatbot, while it was fun to read the crypto specific responses, it quickly got a little tiring because the response generation was taking > 60 seconds for each query. Speed of response is critical for establishing a good user experience

f. Safe: Every model needs to operate within established guard-rails. If a consumer facing Gen AI product generates images based on text inputs, the images need to be not only brand safe for the business but also generate images which adhere to age/sensitive category guidelines. Another example is any information deemed confidential should be parsed out/out of bounds by models.

Human evaluation is necessary to evaluate the quality of the query as well as the response generation. What you feed in as a query is tied to the response generated which is why it’s critical to evaluate the quality of the input or user query

Is the user intent or input well defined or clear? (Example: How do I troubleshoot my device isn’t working with error message “400 input cable)
Is there refinement needed in the user input? (Example: My device isn’t working)

Measuring quality of response generation can be done by

A. evaluating the responses based on

Factuality (indicated by Yes/No) — based on where the response is supported by key facts/evidence
Relevant (indicated by Yes relevant/Partially relevant/Not relevant)
Useful or answers the question (indicated by Yes/Partial/No — missing critical information)

B. Using custom metrics to score for example “Quality score = weighted scores based on factuality/relevancy/accuracy, speed of response.

The How to Measure is the area which IMO is the piece in flux and sans any industry standards or benchmarks. In my mind, there is a lot to be defined in this space and will come together fairly quickly just as quickly as Taylor Swift keeps releasing double albums.

*Swiftie: loyal Taylor Swift fans who can make or break the internet

Gen AI and Measurement (2024)

References