Generating Anki decks with audio from the Tatoeba Project

tl;dr: If you look for the precompiled decks, here is a list.

A Hebrew version of this page is available here.

1. Background

The most effective Anki decks I know for language acquisition share three features:

The Tatoeba Project has millions of sentences, many of which were recorded by native speakers. As far as I am aware, there are no pre-made tools for importing Tatoeba sentences with audio to Anki, so I made such a tool and created decks for all of the languages that have recordings, incrementally sorted using the MorphMan addon. In order to save you the trouble of creating the decks on your own, I’ve already compiled decks for all of the languages that have audio sentences in Tatoeba.

Tatoeba is an amazing public and open collaborative project, but as all such projects it is not free of mistakes and low-quality entries. The volunteers do their best to minimize them, but they still exists. Having an audio recording can be seen as a filter for better quality sentences: if someone took their time to record a sentence, it is presumably of good quality.

2. Deck generation process

I made the decks on a Linux machine, but I guess you could do the same on macOS or Windows (with Cygwin or something similar). The tools we will use are: sh, Wget, tar, bzip2, sed, uniq, SQLite and Python.

If for some reason the pre-made decks do not suit your needs, don’t hesitate to contact me if you need help with generating a deck the do.

2.1. Downloading the data

Tatoeba exports its data as downloadable CSV files. The files we need are: sentences.csv, links.csv, tags.csv and sentences_with_audio.csv:

We can download, unpack and prepare them for later use by running this script:

download_and_prepare_csv.sh
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/usr/bin/sh

mkdir -p csv
pushd csv

# Download
wget https://downloads.tatoeba.org/exports/sentences.tar.bz2
wget https://downloads.tatoeba.org/exports/links.tar.bz2
wget https://downloads.tatoeba.org/exports/tags.tar.bz2
wget https://downloads.tatoeba.org/exports/sentences_with_audio.tar.bz2

# Decompress and untar
for f in *.tar.bz2; do
        tar jxf $f
done

# Prepare
sed 's/"/""/g;s/[^\t]*/"&"/g' sentences.csv > sentences.escaped_quotes.csv
sed 's/"/""/g;s/[^\t]*/"&"/g' tags.csv > tags.escaped_quotes.csv
uniq sentences_with_audio.csv > sentences_with_audio.uniq.csv

# Remove compressed files
rm -i {sentences, links, tags, sentences_with_audio}.tar.bz

popd

2.2. Making a local database

Now we want to make a database so we will be able to run queries. I chose SQLite because it is local, fast, convenient and easy to setup. By running this SQL file (sqlite3 -init create_db.sql) we create a database with the tables we need and import the data:

create_db.sql
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
.open tatoeba.sqlite3

-- Tatoeba’s database has many deleted entries in `sentences` which are referenced from `sentences_with_audio` and `links`; expect tons of warning messages
PRAGMA foreign_keys = ON;

CREATE TABLE sentences (
        sentence_id INTEGER PRIMARY KEY,
        lang TEXT,
        text TEXT
);
CREATE TABLE sentences_with_audio (
        sentence_id INTEGER PRIMARY KEY,
        username TEXT,
        license TEXT,
        attribution_url TEXT,
        FOREIGN KEY (sentence_id) REFERENCES sentences(sentence_id)
);
CREATE TABLE links (
        sentence_id INTEGER,
        translation_id INTEGER,
        FOREIGN KEY (sentence_id) REFERENCES sentences(sentence_id),
        FOREIGN KEY (translation_id) REFERENCES sentences(sentence_id)
);
CREATE TABLE tags (
        sentence_id INTEGER,
        tag_name TEXT,
        FOREIGN KEY (sentence_id) REFERENCES sentences(sentence_id)
);

CREATE INDEX links_index ON links(sentence_id, translation_id);
CREATE INDEX tags_index ON tags(sentence_id, tag_name);

.separator "\t"
.import csv/sentences.escaped_quotes.csv sentences
.import csv/sentences_with_audio.uniq.csv sentences_with_audio
.import csv/links.csv links
.import csv/tags.escaped_quotes.csv tags

The CSV files has many references to deleted sentences, so many warnings will be shown.

2.3. Writing the main query

Now we want to write a query that will output an Anki-importable file with this data in each row:

For this purpose I wrote this patchy Python script:

query.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#!/usr/bin/env python3

import argparse
import csv
import os
import sqlite3

output_dir = 'output'
native_langs = []


def native_lang_columns():
        def native_lang_column(lang):
                return f"""
                "<ul class=""translations""><li>" ||
                (
                                SELECT group_concat(sentences.text, "</li><li>")
                                FROM links JOIN sentences
                                ON
                                                links.translation_id = sentences.sentence_id
                                WHERE
                                                links.sentence_id = target_sentences.sentence_id
                                                AND
                                                sentences.lang = '{lang}'
                                                )
                || "</li></ul>"
                """
        result = ""
        for lang in native_langs[:-1]:
                result += native_lang_column(lang) + ", "
        result += native_lang_column(native_langs[-1])
        return result


def main():
        parser = argparse.ArgumentParser(
                description="Make a CSV files of sentences from the Tatoeba Project that have audio, along with their translations into selected languages")
        parser.add_argument("-t", "--target", type=str,
                                                help="target language",
                                                required=True)
        parser.add_argument("-n", "--native", type=str,
                                                help="native languages (space-delimited, within quotes)",
                                                required=True)
        parser.add_argument("-d", "--database", type=str,
                                                help="database file",
                                                default = "tatoeba.sqlite3")
        args = parser.parse_args()
        global native_langs
        native_langs = args.native.split(" ")

        conn = sqlite3.connect(args.database)
        c = conn.cursor()

        query = f"""
        SELECT
                        target_sentences.sentence_id,
                        target_sentences.text,
                        "[sound:tatoeba_" || "{args.target}" || "_" || target_sentences.sentence_id || ".mp3]",
                        "<ul class=""tags""><li>" ||
                        (
                                SELECT group_concat(tag_name, "</li><li>")
                                FROM tags
                                WHERE tags.sentence_id = target_sentences.sentence_id
                        )
                        || "</li></ul>",
                        {native_lang_columns()}
        FROM
                        sentences AS target_sentences
        WHERE
                        target_sentences.lang = "{args.target}" AND
                        target_sentences.sentence_id IN (SELECT sentence_id FROM sentences_with_audio)
        ;
        """
        if not os.path.exists('output'):
                os.makedirs('output')
        with open(f'{os.path.join(output_dir, args.target)} → {args.native}.csv', 'w', newline='') as csvfile:
                out = csv.writer(csvfile, delimiter='\t',
                                quotechar='|', quoting=csv.QUOTE_MINIMAL)
                for row in c.execute(query):
                        out.writerow(row)

        conn.close()


if __name__ == '__main__':
        main()

In order to produce, for example, a deck of Finnish sentences with audio along with translations into English, Russian, Spanish, Italian and Japanese (whenever available), we use it like this (with ISO 639-3 codes):

1
./query.py -t fin -n "eng rus spa ita jpn"

Running the script will produce a CSV file in the output subdirectory.

2.4. Downloading the audio files

Each audio file is available from this URL: https://audio.tatoeba.org/sentences/sentence_id. In order to make a list of all of the files to be downloaded we use this Python script:

audio_urls.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env python3

import argparse
import csv
import sqlite3


def main():
        parser = argparse.ArgumentParser(
                description="Make a list of URLs of audio files for a specific language from the Tatoeba Project")
        parser.add_argument("-t", "--target", type=str,
                                                help="target language",
                                                required=True)
        parser.add_argument("-d", "--database", type=str,
                                                help="database file",
                                                default = "tatoeba.sqlite3")
        args = parser.parse_args()

        conn = sqlite3.connect(args.database)
        c = conn.cursor()

        query = f"""
SELECT
        sentence_id
FROM
        sentences
WHERE
        lang = '{args.target}' AND
        sentence_id IN (SELECT sentence_id FROM sentences_with_audio)
        """

        for row in c.execute(query):
                print("https://audio.tatoeba.org/sentences/" + args.target + "/" + str(row[0]) + ".mp3")


if __name__ == '__main__':
        main()

This shell script downloads the files and renames them properly:

download_and_prepare_audio
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/sh

mkdir -p output/audio

# Source: https://stackoverflow.com/a/11850469
./audio_urls.py -t $1 -d $2 | xargs -n 1 -P 2 wget --directory-prefix=output/audio/ --continue

for f in output/audio/*; do
        mv "$f" "$(echo $f | sed 's/^output\/audio\//output\/audio\/tatoeba_'$1'_/g')";
done

2.5. Creating a proper note type

Now that we have the data exported to an importable CSV files we want to create a proper note type for it (Tools → Manage Note Types → Add). We need the following fields: sentence_id, target (e.g. fin), audio, tags, and a field for each of the languages we are familiar with. In addition, we need these fields for MorphMan: MorphMan_FocusMorph, MorphMan_Index, MorphMan_Unmatures, MorphMan_UnmatureMorphCount, MorphMan_Unknowns, MorphMan_UnknownMorphCount, MorphMan_UnknownFreq. Delete the default Front and Back fields and close the window.

Now, pressing the Cards button will open a window similar to this:

Figure 1: Card editing window.

We create three cards: Reading, Listening and Production. For Reading in our Finnish exemple we write

1
<p>fin</p>

in the front template and

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{{FrontSide}}

<hr id="answer">

<p id="tags">{{tags}}</p>

<p>{{eng}}</p>
<p>{{rus}}</p>
<p>{{spa}}</p>
<p>{{ita}}</p>
<p>{{jpn}}</p>

<p>{{audio}}</p>

<p id="tatoeba"><a href="https://tatoeba.org/eng/sentences/show/{{sentence_id}}"><img src="_tatoeba.svg" /></a></p>

in the back template.

The shared styling should be something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
.card {
        font-family: sans;
        font-size: 20px;
        text-align: center;
        color: #111;
        background-color: #fffff8;
}
.card.night_mode {
        background-color: #2E3440;
        color: #D8DEE9;
}

#lookup a, #tatoeba {
        color: inherit;
        text-decoration: inherit;
}

#tatoeba img {
        width: 1em;
        height: auto;
}

#tags {
        font-size: small;
}

.hebrew, .arabic {
        direction: rtl;
}

.translations, .tags {
        list-style-type: none;
        margin: 0;
        padding: 0;
}
.translations li, .tags li {
        display: inline;
}
.translations li:after, .tags li:after {
        content: " · "
}
.translations li:last-child:after, .tags li:last-child:after {
        content: ""
}

Listening’s front template:

1
<p>{{audio}}</p>

Listening’s back template:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{{FrontSide}}

<hr id="answer">

<p id="lookup">{{fin}}</p>

<p id="tags">{{tags}}</p>

<p>{{eng}}</p>
<p>{{rus}}</p>
<p>{{spa}}</p>
<p>{{ita}}</p>
<p>{{jpn}}</p>

<p id="tatoeba"><a href="https://tatoeba.org/eng/sentences/show/{{sentence_id}}"><img src="_tatoeba.svg" /></a></p>

Production’s front template:

1
2
3
4
5
<p>{{eng}}</p>
<p>{{rus}}</p>
<p>{{spa}}</p>
<p>{{ita}}</p>
<p>{{jpn}}</p>

Production’s back template:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{{FrontSide}}

<hr id="answer">

<p id="lookup">{{fin}}</p>

<p id="tags">{{tags}}</p>

<p>{{audio}}</p>

<p id="tatoeba"><a href="https://tatoeba.org/eng/sentences/show/{{sentence_id}}"><img src="_tatoeba.svg" /></a></p>

2.6. Importing into Anki

Import the CSV file (File → Import). Check Allow HTML in fields. The fields in the CSV file and our note type should match.

Copy the MP3 files to your media collection directory. Its location depends on your operating system; read more here. Download Tatoeba’s logo to the same directory, renaming it _tatoeba.svg.

Now we have a working deck and it’s time to check it using the preview option of the Browse window.

2.7. Sorting the cards

One last thing you might want to do is to sort the cards so new words occur incrementally. MorphMan is an Anki addon that does just this. Read about it in the wiki and/or watch YouTube videos about it.

That’s it. Enjoy learning whatever language you want to learn :-)

If you benefit from Tatoeba please consider joining the project and contribute sentences, translations and audio recordings (see this guide), or donating money.

3. Precompiled decks

I made decks for all of the language in Tatoeba that have audio recordings (except English and Spanish, which have too many sentences) so people will not have to go through the pain (and pleasure…) of making them.

In order to obtain a list of the relevant languages we can run this query:

languages_with_audio.sql
1
2
3
4
5
6
7
.open tatoeba.sqlite3

SELECT lang, COUNT (sentences.sentence_id) AS audio_sentences_no
FROM sentences_with_audio JOIN sentences
ON sentences_with_audio.sentence_id = sentences.sentence_id
GROUP BY lang
ORDER BY audio_sentences_no DESC;

I chose to include translation for the five languages that share most co-translated sentences with the target language. In order to check what these languages are we can run this query:

statistics_for_translated_audio_sentences.sql
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
.open tatoeba.sqlite3

SELECT lang, COUNT (sentences.sentence_id) AS counter
FROM sentences JOIN links
ON sentences.sentence_id = links.translation_id
WHERE
        links.sentence_id IN
        (
                SELECT sentence_id
                FROM sentences
                WHERE
                        lang = 'fin'
                        AND
                        sentence_id in (SELECT sentence_id FROM sentences_with_audio)
        )
GROUP BY lang
ORDER BY counter DESC
LIMIT 5;

3.1. Downloading the decks

The result of the whole process was uploaded to AnkiWeb. The decks have names following this format, conforming to the 60 character limit:

1
All LANGUAGE sentences with recorded audio from Tatoeba

If you find the decks useful, please leave positive feedback on AnkiWeb: it will make me happy and will others find the decks.

For backup purposes you can download the decks from this website’s server; the link is designated by ⭳. I prefer you download the decks from AnkiWeb (), as decks that don’t get enough downloads are removed from AnkiWeb.

A temporary note: one can share only 10 decks a week on AnkiWeb. This is the reason not all decks have AnkiWeb links. I hope soon I will be able to complete sharing all decks there.

19582German 1/2English (18733)Esperanto (9873)French (8322)Russian (7717)Spanish (6917)2019/10/12
German 2/2
11019PortugueseEnglish (10163)Spanish (2282)Esperanto (1704)French (1080)Russian (653)2019/10/12
8181FrenchEnglish (7956)Esperanto (6280)Russian (5051)German (2921)Ukrainian (2204)2019/10/12
6720HungarianEnglish (6048)German (1263)Esperanto (643)Italian (529)French (350)2019/10/12
4690RussianEnglish (3294)Japanese (1874)French (1662)German (1268)Ukrainian (1246)2019/10/12
4598BerberEnglish (4494)Spanish (320)French (315)Kabyle (60)Arabic (46)2019/10/12
4601EsperantoEnglish (3901)French (1340)German (1172)Dutch (748)Spanish (570)2019/10/12
4057FinnishEnglish (4017)Russian (1174)Spanish (1087)Italian (766)Japanese (286)2019/10/12
2491Wu ChineseMandarin (2489)French (633)English (427)Spanish (34)Yue Chinese (32)2019/10/12
1961DutchEsperanto (1935)English (1698)Ukrainian (1460)German (1321)Spanish (1231)2019/10/12
1678Mandarin ChineseFrench (1317)German (1280)English (1260)Wu Chinese (700)Spanish (605)2019/10/12
1283JapaneseEnglish (1278)Russian (1249)Finnish (1050)German (1025)French (658)2019/10/12
1086HebrewEnglish (1086)Esperanto (125)Polish (120)Russian (89)French (83)2019/10/12
1067LatinEnglish (974)Portugeuse (430)Spanish (375)French (293)Esperanto (199)2019/10/12
480Central DusunEnglish (401)Japanese (45)Coastal Kadazan (31)2019/10/12
376MarathiEnglish (376)Hindi (142)2019/10/12
363UkrainianEnglish (363)French (28)German (20)Italian (17)Spanish (15)2019/10/12
224PolishEnglish (222)Dutch (98)German (30)Ukrainian (27)Russian (21)2019/10/12
134ThaiEnglish (87)Esperanto (39)German (38)French (38)Russian (32)2019/10/12
112CatalanEnglish (111)Spanish (41)Ukrainian (31)French (19)Esperanto (17)2019/10/12
60ChavacanoEnglish (53)2019/10/12
53RomanianEnglish (51)Esperanto (51)Dutch (50)German (34)Spanish (34)2019/10/12
37TurkishEnglish (37)German (13)Esperanto (8)Spanish (5)Swedish (3)2019/10/12
28Naga (Tangshang)English (28)2019/10/12