langdetect is simple and lightweight, but it is not very accurate (I tested mostly on Chinese characters).
NOTE: For something more accurate, you probably need something like TextBlob (uses NLTK
) or polyglot (which require NumPy
). I didn't test these though.
from langdetect import detect# Correct resultsdetect('Hello World')# 'en'detect(u'大阪燒肉吃到飽')# 'zh-tw'# Incorrect resultsdetect(u'太空教育')# 'ko', should be 'zh'detect(u'福岡住宿推薦 博多車站前竺紫口 天然溫泉 Super hotel Lohas')# 'et', should be 'zh'
Since I mostly need to detect whether it is English or Chinese, I wrote the following code which works pretty well.
def detect_lang(sample): from langdetect import detect lang = detect(sample) chinese_sample = u"".join(re.findall(ur'[\u4e00-\u9fff]+', sample)) eng_sample = re.sub(r'[^\w]+', '', sample) if len(chinese_sample) > len(eng_sample) * 0.5: lang = "zh" return lang