Essay

Critiquing a Document Similarity System

A legal tech company is developing a feature to find similar documents within a large database of contracts. Their current method uses a pre-trained, general-purpose language model. To get a single vector representation for each contract, they process the text and then calculate the average of the output vectors for all the words. This approach has proven unreliable, often failing to capture the nuanced legal arguments and instead just matching documents with overlapping keywords.

Critique this averaging-based approach. Explain why it is likely failing and propose a more effective strategy that involves adapting the pre-trained model to specialize in this task. Justify why your proposed strategy would lead to more meaningful document representations.

0

1

Updated 2025-10-06

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science