Supported languages

Automatically generated JSON with supported language codes.

We use BCP-47 like language codes that allow us to use them as a label for Taskcluster, parse in various tools and accommodate different specialized use cases like training region-specific and low-resource languages, multilingual models etc.

The language codes should be supported by ICU.

Examples:

Code Language variant
en English
ru Russian
nb Norwegian Bokmal
sq Albanian
zh Chinese in Simplified script (ICU default)
zh_hant Chinese in Traditional script
sr_cyrl Serbian in Cyrillic
pt_br Brazilian Portuguese
ca_valencia Catalan Valencia
hbs Macro language for Serbo-Croatian group

The code for language support and mappings to various tools are encapsulated in the pipeline/langs module.

The current supported languages are defined in pipeline/langs/maps.py.

We generate a JSON for all langauge codes supported by the pipeline: pipeline/langs/all_autogenerated.json. It is useful to verify languages codes for specific tools. To regenerate run:

task generate-langs-map

To add support for a new code, add it to PIPELINE_SUPPORT, regenerate the JSON and adjust mappings for various tools if needed.