Learn Before
Inter-token Latency (ITL)
Inter-token Latency (ITL) is an efficiency metric that measures the time required to generate each token following the initial one. It is a key indicator of the performance of the model's decoding process.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Request Latency
Throughput
Time to First Token (TTFT)
Inter-token Latency (ITL)
Tokens Per Second (TPS)
Resource Utilization in LLM Inference
Energy Efficiency in LLM Inference
Cost Efficiency in LLM Inference
A startup is building a real-time, interactive chatbot to help customers troubleshoot technical issues. Their engineering team evaluates two different language models, 'Model X' and 'Model Y'. The team's final report concludes that Model X is superior because its responses are consistently more accurate and helpful across a wide range of test queries. Based on this report, the company decides to deploy Model X. Which of the following statements identifies the most critical potential weakness in the team's evaluation process for this specific use case?
LLM Selection for a High-Volume Chatbot
A team is evaluating a large language model for deployment. Match each evaluation goal below to the primary category of metric it represents: 'Output Quality' or 'Efficiency'.
You are evaluating two candidate long-context LLMs...
You lead evaluation for an internal eDiscovery ass...
Your team is writing an internal evaluation checkl...
Your team is selecting an LLM for an internal "pol...
Selecting a Long-Context LLM for a Cost-Constrained Enterprise Document Assistant
Choosing Long-Context Evaluation Evidence for a High-Volume Contract Review Feature
Designing an Evaluation Plan for a Long-Context Compliance Copilot Under Latency and Cost Constraints
Reconciling Long-Context Retrieval Quality with Inference Efficiency for a Meeting-Transcript Copilot
Evaluating a Long-Context LLM for Audit-Ready Evidence Retrieval Under Throughput Constraints
Diagnosing Conflicting Long-Context Evaluation Signals for an Internal Knowledge Assistant
Learn After
A company is developing a real-time, interactive chatbot for customer support. The primary goal for user experience is that once the chatbot starts replying, the rest of its message appears to stream smoothly and continuously, creating a fluid conversational flow. The team is evaluating two different language models:
- Model Alpha: Responds almost instantly after the user sends a message, but each subsequent word appears with a noticeable, consistent pause.
- Model Beta: Takes a moment longer to begin its response, but once it starts, the entire rest of the message is generated very rapidly with no perceptible delay between words.
Which model should the company choose to best achieve its primary user experience goal, and why?
LLM Performance Analysis for Code Completion
A team is optimizing a language model for a real-time, streaming chatbot. They are focused on two distinct aspects of the user's perception of speed. Match each performance characteristic with the user experience it directly impacts.