Python Detect Language of String

Sep 30, 2018

python

langdetect is simple and lightweight, but it is not very accurate (I tested mostly on Chinese characters).

NOTE: For something more accurate, you probably need something like TextBlob (uses NLTK) or polyglot (which require NumPy). I didn't test these though.

from langdetect import detect# Correct resultsdetect('Hello World')# 'en'detect(u'大阪燒肉吃到飽')# 'zh-tw'# Incorrect resultsdetect(u'太空教育')# 'ko', should be 'zh'detect(u'福岡住宿推薦 博多車站前竺紫口 天然溫泉 Super hotel Lohas')# 'et', should be 'zh'

Since I mostly need to detect whether it is English or Chinese, I wrote the following code which works pretty well.

def detect_lang(sample):    from langdetect import detect        lang = detect(sample)    chinese_sample = u"".join(re.findall(ur'[\u4e00-\u9fff]+', sample))    eng_sample = re.sub(r'[^\w]+', '', sample)    if len(chinese_sample) > len(eng_sample) * 0.5:        lang = "zh"    return lang

❤️ Is this article helpful?

Buy me a coffee ☕ or support my work via PayPal to keep this space 🖖 and ad-free.

Do send some 💖 to @d_luaz or share this article.

✨ By Desmond Lua

A dream boy who enjoys making apps, travelling and making youtube videos. Follow me on @d_luaz

👶 Apps I built

Travelopy - discover travel places in Malaysia, Singapore, Taiwan, Japan.