Building a Textfile Translator

ML Based Text Translation has always been fascinating to me as it enables users to save countless hours by not needing to learn certain languages.

The translation services online, more specifically Google Translate has always been quite useful when I had to translate a few sentences or paragraphs.

However this became a problem when I needed to translate large verbose pdfs that were based in a different language than English. This is because there is a 5000 character limit in Google Translate.

While not an exact science, a non-fiction book is generally between 80,000 to 90,000 words. While another source states that it should be between 30,000 and 70,000 words.

On average, text contains between 5 and 6.5 characters per word including spaces and punctuation.

Value Proposition

Doing the maths, translating this type of work can easily exceed over 100,000 characters. Not to mention translators are not cheap. At 10 cents a word being one of the lower end costs, we can expect an average cost of $300 to $700 for a non-fiction book.

To solve this issue I was facing, I first converted the PDF into a TXT file using this OCR converter site. I then developed a simple translator that uses Google API with less than 50 lines of code.

API Key Translator Setup

The main reason I used Google’s API to make this program was because the company was offering a free limit of 500,000 characters every month with an ease of use for beginners I could not find in other APIs at the time of writing this article.

Code

Here are the key excerpts I took from my code.

As you can see the main algorithm is a very simple for loop where the target textfile to translate is looped and each line is translated via iteration and stored in another textfile. This script runs in O(n) time.

import os
from google.cloud import translate_v2 as translate

"""Translates text into the target language.
Target must be an ISO 639-1 language code.
http://g.co/cloud/translate/v2/translate-reference#supported_languages
"""

translate_client = translate.Client()
for line in f1:
    result = translate_client.translate(line, target_language=target)
    # Stores translated line in another txt file
    f2.write(result['translatedText'] + '\n')

My text-translator project is hosted here on Github.

Running Locally

The code is quite straightforward to run locally, after downloading the repo, just make sure to install the packages required.

Command Prompt

Results

Here is the output from the script below. I had actually used a file I had converted to English from Greek. This is me, reconverting the Greek to English. Back Translation is quite important to verify the authenticity of the translation. This shows the power of the Google API along with ease of use.

Conclusion

Thanks for reading, feel free to reach out at my LinkedIn if you have any questions about the project or in general.