Supported languages
Automatically generated JSON with supported language codes.
We use BCP-47 like language codes that allow us to use them as a label for Taskcluster, parse in various tools and accommodate different specialized use cases like training region-specific and low-resource languages, multilingual models etc.
The language codes should be supported by ICU.
Examples:
| Code | Language variant |
|---|---|
| en | English |
| ru | Russian |
| nb | Norwegian Bokmal |
| sq | Albanian |
| zh | Chinese in Simplified script (ICU default) |
| zh_hant | Chinese in Traditional script |
| sr_cyrl | Serbian in Cyrillic |
| pt_br | Brazilian Portuguese |
| ca_valencia | Catalan Valencia |
| hbs | Macro language for Serbo-Croatian group |
The code for language support and mappings to various tools are encapsulated in the pipeline/langs module.
The current supported languages are defined in pipeline/langs/maps.py.
We generate a JSON for all langauge codes supported by the pipeline: pipeline/langs/all_autogenerated.json. It is useful to verify languages codes for specific tools. To regenerate run:
task generate-langs-map
To add support for a new code, add it to PIPELINE_SUPPORT, regenerate the JSON and adjust mappings for various tools if needed.