Bicleaner
Bicleaner AI is a tool that aims at detecting noisy sentence pairs in a parallel corpus. The classifier scores parallel sentences from 0 to 1 where 0 means a very noisy translation and 1 is a good translation. If a specialized model for a language pair is not available it will fallback to downloading a multilingual en-xx model.
For supported languages see:
How to configure for training
The configuration specifies a default threshold and a per-dataset threshold. A sentence pair will be kept if its score is above the given threshold.
0.5
should be a good default value.- Increase the threshold for noisier datasets.
- Set the threshold to
0
to skip cleaning entirely.
Recommendations for specific datasets
Data set | Threshold | Reason |
---|---|---|
OpenSubtitles | 0.8 | This is a noiser dataset |
ParaCrawl | 0 | This dataset has already been cleaned by bicleaner. See Bicleaner AI: Bicleaner Goes Neural, section 4.2.2 |
Example config:
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_CCAligned/v1: 0.7
opus_OpenSubtitles/v2018: 0.8
opus_ParaCrawl/v9: 0
...
Models and caching
In the current implementation an appropriate model is downloaded for a language pair on demand and then cached. For example for en-ru
a multilingual en-xx
will be downloaded since a dedicated model is not available for this pair. For pt-en
en-pt
will be downloaded since all the models have English as the first language.
The downloaded model will be cached in Taskcluster under the requested language pair (en-ru
, pt-en
). If a new model is added to Hugging Face repo it would be a good idea to invalidate the caches manually by editing pipeline/bicleaner/download_pack.py
. We do not do this automatically in the current implementation. We will rethink this strategy if this happens often.