Beyond the hype: building LLM applications for production

It’s been nearly four months since we launched the first LLM-based feature in Talkdesk: call summarization. An evident first candidate given the relatively simple nature of the use case and the expected impact on agent productivity. While beta testing this new feature with a few selected customers, we moved on to a few other use cases, namely topic extraction, to provide the customer with a list of topics discussed in their calls or chats; and question answering, to support agents handling customer queries.

Many other use cases are already lined up, from automated agent evaluation to message writing helpers. But as a product manager’s mind navigates the sea of possibilities, it’s also important to take a step back and think about the lessons we’re learning along the way.

The ambiguity trap – how users write instructions

I’ve seen many posts arguing that LLMs will revolutionize user experience for the better. With users being able to express their intent using their own language, we can eliminate the learning curve when interacting with an application. I understand this claim, but I think it is somehow naive to think that people are able to always express themselves clearly and concisely. Just think of how many times you, as a human, have received instructions from other humans that were unclear or incomplete. As I read in another post about the challenges of LLMs – just doing what someone asks for isn’t always the right thing. Designing interfaces where user input is in natural language may increase accessibility, but will likely reduce the quality of the output, therefore increasing frustration.

We’ve seen this happen with one particular tool that we made available, where users can instruct the system to generate model training phrases for them by describing a specific user intent in text. This used to be a daunting manual task, so automation is welcomed. However, past the initial excitement, some users started asking the hard questions: what are the best practices to instruct the model for phrase generation? What sort of information needs to be included in the description? Ambiguity generates anxiety, and we don’t want to transform users into prompt engineers. Interacting with a clean UI with affordance can have much lower cognitive load than writing a thorough description of the outcome you’re looking for. This article by Davis Treybig is great and describes all of these design challenges in detail.

The ambiguity trap II  – how LLMs respond to instructions

Downstream applications relying on LLMs expect outputs that conform to a particular structure for effective parsing. While we can tailor our prompts to explicitly outline the desired output format, this is never guaranteed. We are seeing this with question answering. We prompt the model not to return an answer when it isn’t sure of one, but as a prolific chatter, sometimes it will still respond with an “I don’t know” type of answer. How do we deal with this in the UI? The application’s frontend is not able to tell the difference between this answer and a “valid” one, with meaningful content.

Despite this being a nuisance, to someone using a helper tool there’s one thing that is worse than not getting an answer, which is getting a wrong one. This is not a new problem as poor search engines can also produce noisy, non-relevant results with high confidence. But dealing with hallucination is a whole different challenge. Conservative prompt engineering can be effective at tackling this problem, but in the end we can never be 100% sure that the model will comply with instructions.

It’s still unclear to me how we can deal with the lack of predictability. For now, the only feasible option seems to be through prompt engineering. We need to make it a systematic task, with version control – essentially yet another function that needs to be integrated in the product development lifecycle. 

Working with context is hard

Building a question answering solution for business requires informing the LLM of the knowledge context of each customer. Answers need to be strictly based on company vetted data. However, LLMs have context windows, that is, they limit the amount of tokens the model can access when generating responses. Some businesses can have hundreds of thousands of documents with relevant information. We use embeddings to measure content relatedness and select the correct data snippets – ultimately, the success of this operation will dictate the overall success of the feature. It’s just good old search – if that doesn’t work, LLMs won’t save you when it comes to knowledge retrieval.

Cost estimation: mission impossible

The more context you give the model, the better the performance – or at least that’s what we hope for. However, this will also increase latency… and costs – OpenAI charges for both input and output tokens. If we factor in the “natural” output unpredictability, customer variations when it comes to context and constant prompt improvements, making a sound prediction is a difficult task. Also, the LLM world is moving so fast that any prediction is bound to become outdated quickly – hopefully the trend is for cost to continue to go down as competition grows.

Conclusion

“It was the best of times, it was the worst of times” – said Dickens on LLMs. 

It’s impossible not to be excited with the quantum leap of conversational abilities by machines. You can have so much fun experimenting with it, and demos can be mind blowing. However if an incredible demo isn’t followed by actual usage – and more, by a productivity gain (or at least, a super pleasant user experience) – the result will sooner or later be churn. And the market seems to be moving fast from the initial excitement phase to converge in a handful of low risk use cases, particularly in the B2B space – as with any other new tech. Some people will argue that we cannot afford to be conservative – I argue that we cannot afford to ship products that will not solve the customer’s job. Be bold, but learn fast!