Supported languages

Automatically generated JSON with supported language codes.

We use BCP-47 like language codes that allow us to use them as a label for Taskcluster, parse in various tools and accommodate different specialized use cases like training region-specific and low-resource languages, multilingual models etc.

The language codes should be supported by ICU.

Examples:

Code	Language variant
en	English
ru	Russian
nb	Norwegian Bokmal
sq	Albanian
zh	Chinese in Simplified script (ICU default)
zh_hant	Chinese in Traditional script
sr_cyrl	Serbian in Cyrillic
pt_br	Brazilian Portuguese
ca_valencia	Catalan Valencia
hbs	Macro language for Serbo-Croatian group

The code for language support and mappings to various tools are encapsulated in the pipeline/langs module.

The current supported languages are defined in pipeline/langs/maps.py.

We generate a JSON for all langauge codes supported by the pipeline: pipeline/langs/all_autogenerated.json. It is useful to verify languages codes for specific tools. To regenerate run:

task generate-langs-map

To add support for a new code, add it to PIPELINE_SUPPORT, regenerate the JSON and adjust mappings for various tools if needed.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search