langmatch in Launchpad

langmatch

Registered 2012-01-07 by Jeroen T. Vermeulen

Guess what language a text is in. Check if a document is in English, French, Swahili, Turkish, Japanese, Chinese (and is it simplified or traditional?), and so on.

Contains a command-line tool, library, and language data.

Use langmatch to figure out what language(s) a given text is most likely to be, out of the ones you have statistical data for. You can also use it to generate language data based on sample texts, and langmap can read the “language map” (or “language fingerprint”) files used by other programs such as mguesser.

They need not be all different languages either; they could be the same language in different styles, or from different eras, and so on. Nor are you limited to a single “best guess”: langmatch will give you any number of best guesses, with a confidence level for each.

Call langmatch from the command line or from a script, or import it directly into your own python program. Langmatch works in python 2.6, 2.7, and 3 and is fully unicode-aware. It opens files (compressed or regular), but also URLs so you can read files directly off the internet.