Code

Preprocessing the MTFraEng Dataset

The conceptual preprocessing steps for the English-French dataset can be implemented in a _preprocess method added to the MTFraEng class. This method standardizes the raw text by first replacing non-breaking spaces (\u202f and \xa0) with regular spaces. It then converts the text to lowercase and iterates through the characters, selectively inserting a space before punctuation marks (e.g., , . ! ?) if a space does not already precede them.

@d2l.add_to_class(MTFraEng) def _preprocess(self, text): # Replace non-breaking space with space text = text.replace('\u202f', ' ').replace('\xa0', ' ') # Insert space between words and punctuation marks no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' ' out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char for i, char in enumerate(text.lower())] return ''.join(out)

0

1

Updated 2026-05-14

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L