Python Detect Language of String

September 30, 2018

langdetect is simple and lightweight, but it is not very accurate (I tested mostly on Chinese characters).

NOTE: For something more accurate, you probably need something like TextBlob (uses NLTK) or polyglot (which require NumPy). I didn’t test these though.

from langdetect import detect

# Correct results

detect('Hello World')
# 'en'

# 'zh-tw'

# Incorrect results
# 'ko', should be 'zh'

detect(u'福岡住宿推薦 博多車站前竺紫口 天然溫泉 Super hotel Lohas')
# 'et', should be 'zh'

Since I mostly need to detect whether it is English or Chinese, I wrote the following code which works pretty well.

def detect_lang(sample):
    from langdetect import detect
    lang = detect(sample)

    chinese_sample = u"".join(re.findall(ur'[\u4e00-\u9fff]+', sample))
    eng_sample = re.sub(r'[^\w]+', '', sample)

    if len(chinese_sample) > len(eng_sample) * 0.5:
        lang = "zh"

    return lang
This work is licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License.