Anki Flashcard ETL Pipeline

Project Summary

The beauty of having a CS degree is that you can use it to make your life easier.

I like learning languages (200+ day Duo Lingo streak nbd). What ways are there to get better at speaking a language? Reading, writing, speaking, practicing.

For my practicing, I generally use a flashcard application called Anki that is popular among medical students. The premise of the application is interval-based learning. You do flashcards everyday and each time you get it right, the interval for when you receive the card next increases.

If you have a new card and get it right, it'll give it to you in 1 day. If you get it right in 1 day, it'll give it to you in 4 days. If you get it right in 4 days, it'll give it to you in 10 days, etc. If you get the card wrong, it'll update the interval back to 1 days and you'll start your intervals for that card over.

I use this for language vocabulary. I found the process of manually adding the cards to be cumbersome though. I also wanted to add audio files of my vocab words so that I could hear the proper pronunciation of the words. I also wanted to be able to add content from different formats: csv files with words and their translations, Duo Lingo vocabs words, txt files with just english words to translate.

Luckily the application, Anki, that I use has an API. So I made a simple ETL (ETL = Extract Translate Load) Pipeline to do automate my card creation and upload for me.

(System Diagram)

System Components

Anki Application: Local hosted Anki Application accessed by AnkiConnect API
Python Script: extracts words then coordinates word transformations and uploads to Anki.
Google gTTS: API that converts text to speech
Google Translate: API that translations text
OpenAI API: used to correct word formatting and add articles

Inputs

Language Choice:. Language selected by user passed in as a script parameter
DuoLingo Words (html): Grab the html from DuoLingo website from Practice page
English Words (txt): List of English words to be translated
Manual Translations (txt): Foreign words with English translation separated by a delimiter
ChatGPT Files (csv): Foreign words and english translations generated with prompting
Existing Words (pkl): Dictionary of existing words to avoid repeats stored as a pickle file

Outputs

Flashcards (json): Cards formatted and uploaded to Anki Application via AnkiConnect API
Audio Files (mp3): MP3 files generated using gTTS and moved to local Anki media folder
Execution Logs (txt): Generated on each run documenting words uploads succeeded or failed and why
File Archives (txt, csv, html): Stores input files just processed

Dependencies

gTTS: Python library for making calls to Google Text To Speech
requests: Python library used for API calls
openai: library for making calls to openai
beautifulsoup4: library for parsing html
google-cloud-translate: library for making calls to google translate

Code + High Level Program Description

Code Accessible at: https://github.com/jbelshe/anki-language-learner

To trigger the script to run, I just activate my python environment and run it my the local host from my terminal. In the command lines, I input the desired language (example: “python3 translator.py German”)

Once the script is called, the operating directory is set based off of what language the script was called for (ex. German). The script checks the “Input” folder to see what files exists. If the files exist and are not empty, they are individually parsed to extract the words.

All input files provide words in a different format, so it is important to extract all relevant words and get them in a matching and appropriate format. See the image below for the generalized flow of data and it's transformations as it goes through the system.

(Input file transformation flow)

Each input file requires different handling. For the manual translations is in the correct format already since it was uploaded with its desired translation, so it just needs to be parsed according to what the delimiter is set to.

Similarly, the ChatGPT generated vocab files (example prompt: give me a csv file of different foods in German with the german word first and the english word second) comes out in a predetermined format, so it simply needs to be parsed.

For the HTML file pulled from the DuoLingo website, the html is parsed using the python library, Beautiful Soup. DuoLingo provides words + definitions, so these are safe to add to our list of words to add as is. The English words file needs to be sent to the Google Translate API to get the appropriate translation. Once the English words have their translations, they are sent along with the DuoLingo translations to the ChatGPT API. The purpose of this is to ensure that nouns have the proper articles attached to them (ex. “Tree - Baum” becomes “the tree - der baum”).

Once the input files have all been parsed properly and all the foreign word/english translations have been extracted and set up with the proper articles, we have our list of words to add. We then send all of the foreign words to the gTTS API, which takes the requested language key (example German = “de”) and provides a pronunciation of the foreign word.

We send the API a list of each of the words and receive a mp3 file for each with a specified name. We save these files to the output and save the name.

Now we have all the information that we need to create out flashcards. Using the foreign word, the english word, and the mp3 file name we can make our flashcard into the json format required by the Anki to upload flashcards. Once created we upload all of the files to our local Anki Application using the AnkiConnect API and move the MP3 files to Anki’s media folder.

We now have all of our flashcards created and added with appropriate MP3 audio files available to play whenever selected on the app. Ta-Da!