Guess what language a text is in. Check if a document is in English, French, Swahili, Turkish, Japanese, Chinese (and is it simplified or traditional?), and so on.
Contains a command-line tool, library, and language data.
Use langmatch to figure out what language(s) a given text is most likely to be, out of the ones you have statistical data for. You can also use it to generate language data based on sample texts, and langmap can read the “language map” (or “language fingerprint”) files used by other programs such as mguesser.
They need not be all different languages either; they could be the same language in different styles, or from different eras, and so on. Nor are you limited to a single “best guess”: langmatch will give you any number of best guesses, with a confidence level for each.
Call langmatch from the command line or from a script, or import it directly into your own python program. Langmatch works in python 2.6, 2.7, and 3 and is fully unicode-aware. It opens files (compressed or regular), but also URLs so you can read files directly off the internet.
View full history Series and milestones
trunk series is the current focus of development.
All bugs Latest bugs reported
-
Bug #1087991: Test failure: test_ignores_oversized_grams
Reported -
Bug #1015861: A map of insufficient gram length should produce a warning
Reported -
Bug #1009290: error loading language-maps/ca-gutenberg.lm.bz2
Reported -
Bug #985372: Arrays may be faster
Reported -
Bug #982957: Module boilerplate code is in the wrong order
Reported