1Cademy - Preprocessing the MTFraEng Dataset

Learn Before

Downloading the MTFraEng Dataset

Code

Preprocessing the MTFraEng Dataset

The conceptual preprocessing steps for the English-French dataset can be implemented in a _preprocess method added to the MTFraEng class. This method standardizes the raw text by first replacing non-breaking spaces (\u202f and \xa0) with regular spaces. It then converts the text to lowercase and iterates through the characters, selectively inserting a space before punctuation marks (e.g., , . ! ?) if a space does not already precede them.

@d2l.add_to_class(MTFraEng)
def _preprocess(self, text):
    # Replace non-breaking space with space
    text = text.replace('\u202f', ' ').replace('\xa0', ' ')
    # Insert space between words and punctuation marks
    no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' '
    out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
           for i, char in enumerate(text.lower())]
    return ''.join(out)

Updated 2026-05-14

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn After

Machine Translation Dataset Iterator

Learn Before

Related

Learn After