How to detect the language of a string?

language detection
nltk language detection
language detection api python
best language detection library python
language detection software
free language detection api
website language detection
c# detect string language

What's the best way to detect the language of a string?

If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this article on how to call the API from c#.

UPDATE: That c# link is gone, here's a cached copy of the core of it:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
   new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
   key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

[Serializable]
public class JSONResponse
{
   public string responseDetails = null;
   public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

[Serializable]
public class Translation: JSONResponse
{
   public TranslationResponseData responseData = 
    new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

[Serializable]
public class TranslationResponseData
{
   public string translatedText;
}

Finally, we can make the GoogleTranslator class:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

   public class GoogleTranslator
   {
      private string _q = "";
      private string _v = "";
      private string _key = "";
      private string _langPair = "";
      private string _requestUrl = "";
      private string _translation = "";

      public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
         LANGUAGE languageTo, string key)
      {
         _q = HttpUtility.UrlPathEncode(queryTerm);
         _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
         _langPair =
            HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
            "|" + EnumStringUtil.GetStringValue(languageTo));
         _key = HttpUtility.UrlEncode(key);

         string encodedRequestUrlFragment =
            string.Format("?v={0}&q={1}&langpair={2}&key={3}",
            _v, _q, _langPair, _key);

         _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

         GetTranslation();
      }

      public string Translation
      {
         get { return _translation; }
         private set { _translation = value; }
      }

      private void GetTranslation()
      {
         try
         {
            WebRequest request = WebRequest.Create(_requestUrl);
            WebResponse response = request.GetResponse();

            StreamReader reader = new StreamReader(response.GetResponseStream());
            string json = reader.ReadLine();
            using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
            {
               DataContractJsonSerializer ser =
                  new DataContractJsonSerializer(typeof(Translation));
               Translation translation = ser.ReadObject(ms) as Translation;

               _translation = translation.responseData.translatedText;
            }
         }
         catch (Exception) { }
      }
   }
}

Detecting languages (Basic) | Cloud Translation, Detect the language of a string. bad detection. You can use the whatlanguage gem to detect the language of a Ruby string. Note that it also has not been  Detect language from string in PHP. PHP Server Side Programming Programming. The language can’t be detected from the character type. There are other ways, but they

Fast answer: NTextCat (NuGet, Online Demo)

Long answer:

Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

There were no ports in .Net. So I have written one: NTextCat on GitHub.

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

Detect the language of a string, NET Framework that lets you identify in which spoken language the text in a string is written if that's what you are asking. You could possibly  hi gurus. i do have one peculiar requirement. i will be getting a text (type string) as import parameter in the function module. i need to check language of that string, based on the logic i need to do other things.

A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

How to detect the language of the given string using c#, The NSLinguisticTagger class has dedicated code to help you identify the dominant language of a text string. Before I show you the code, there  Method 1: Language models. A language model gives us the probability of a sequence of words. This is important because it allows us to robustly detect the language of a text, even when the text contains words in other languages (e.g.: "'Hola' means 'hello' in spanish"). You can use N language models (one per language), to score your text.

If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.

You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

How to detect the dominant language of a text string, langdetect; textblob; langrid. Method 1: Using langdetect library. This module is a port of Google's language-detection library that supports 55 languages  In this trivial example the language detection works perfectly. select id, detect_language(text) language from unknown_language order by id; ID LANGUAGE -- ----- 1 ENGLISH 2 SPANISH 3 SIMPLIFIED CHINESE 4 GERMAN 5 RUSSIAN

Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

Detect an Unknown Language using Python, The language can't be detected from the character type. There are other ways, but they don't guarantee complete accuracy. This Language Detection Library for Java should give more than 99% accuracy for 53 languages. Alternatively, there is Apache Tika, a library for content analysis that offers much more than just language detection.

Detect language from string in PHP, A language detection library for PHP. Detects the language from a given text string. - patrickschur/language-detection. This library can detect the language of a given text string. It can parse given training text in many different idioms into a sequence of N-grams and builds a database file in JSON format to be used in the detection phase. Then it can take a given text and detect its language using the database previously generated in the training phase.

patrickschur/language-detection: A language detection , Compact Language Detector 3. Description. The function detect_language() is vectorised and guesses the the language of each string in text. Identifying the language used will typically be the first step in most natural language processing tasks. Among the wide variety of language identification methods discussed in the literature, the ones employing the Cavnar and Trenkle (1994) approach to text categorization based on character n-gram frequencies have been particularly successful.

[PDF] Package 'cld3', You can apply check on characters that are present in the string and try to find out what language it is. Example: The Unicode range 0600 — 06FF represents  If string (content of field) is in English language, then i have to concatenate it in one way. But if it's in Chinese then i have to concatenate it in another way. Do we have any function module to detect the language or any SYST field for the same. Help will be greatly appreciated. Thanks!!!!

Comments
  • True, and I used this too. But they are pulling support for it's use.
  • It seems, that this functionality currently is a part of Google Translate API and offered as a paid service. developers.google.com/translate/v2/pricing
  • Awesome work Ivan. I just browsed through your OSS code on Codeplex. I'd be willing to help you with this project if you need it.
  • What license does your library use? I don't see it specified in Github or in the README.
  • Please find the NTextcat implementation with demo application and source code here: codecanyon.net/item/language-detect/23356008?ref=intelliwins
  • Are you saying there's no "y" in Dutch? I can give you 100 Dutch words with a "y" straight away.
  • This might be suitable for a beginning programming class, but is far from a real solution to the problem.
  • But there is no 100% reliable language detection. If you want a fast distinction, unreliable between Dutch and English, counting the y's will perform very nice (that's what the "mostly" means).
  • There are also Python bindings (pip install cld3 but you might need Cython) and Ruby bindings
  • Could you explain how can I add this package to my existing C# windows form application?