intro | start | example | explain | code | demo | test-unicode | more

The basic PHP N-gram Functions

In order to implement your own language detection download class LangDetect with the ready to use library of fingerprints

These short functions, combined with an extensive set of finger-prints, do all the work for a Statistical Language Detection. There's also a simple script at the bottom of this page (see this demo).

  1. Function reads all n-grams of finger-prints into a multidimensional array where the first dimension is a string with the language taken from the filename of the *.lm file, following the naming convention <language-encoding>.lm (e.g. polish-iso8859_2.lm ), the second dimension contains 400 n-grams each (all finger-prints available for download have 400 pieces of n-grams with the size of 1-to-4 letters).
  2. <?php
    function getFingerprint($dir$nb_grams 400) {
        
    $pattern "*.lm";
        
    chdir($dir);
        
    $files glob($pattern);
        foreach (
    $files as $readfile) {
            if (
    is_file($readfile)) {
                
    $bsnm basename($readfile".lm");
                
    $handle fopen($readfile'r');
                for (
    $i=0$i $nb_grams$i++) {
                    
    $line fgets($handle);
                    
    $part explode(" "$line);
                    
    $lm_ng[$bsnm][]= trim($part[0]);
                }
            } 
        }
        return 
    $lm_ng;
    }
    ?>
  3. Creates an array of n-grams of the given string ($string) that needs to be analyzed. Array contains not more than 350 of the top frequent n-grams ($ng_number), starting with the most frequent one, where the key contains the n-gram and the value contains the normalized ranking number. The third, optional parameter ($ng_max_chars) defines the max. number of characters of a n-grams. If you use the downloaded finger-prints from this website, then this parameter mustn't be changed. This function can also be used as a basis to generate a finger-print for a new language (also see: PHP Class LangDetect).
  4. <?php
    function createNGrams($string$ng_number=350$ng_max_chars=4) {
        
    $array_words explode(" "$string);
        
    //iterate over each word, each character, all possible n-grams
        
    foreach($array_words as $word) {
            
    $word "_"$word "_";
            
    $length strlen($word);
            for (
    $pos=0$pos $length$pos++){ //pos within word
                
    for ($chars=0$chars<$ng_max_chars$chars++) {//ng-length
                    
    if (($pos $chars) < $length) {//not beyond end of word
                         
    $array_ngram[] = substr($word$pos$chars+1);
                     }
                 }
             }
        }
        
    //count-> value(frequency, int)... key(ngram, string)
        
    $ng_frequency array_count_values($array_ngram);
        
    //sort array by value(frequency) desc
        
    arsort($ng_frequency);
        
    //use only top frequent ngrams
        
    $most_frequent array_slice($ng_frequency0$ng_number);
        foreach (
    $most_frequent as $ng => $number_frequencey){
            
    $sub_ng[] = $ng;
        }
        return 
    $sub_ng;
    }
    ?>
  5. Compares array of the submitted string ($sub_ng, created in function #2) to the multidimentional finger-print array ($lm_ng, created in funtction #1) and assigns a total number of ranking-deviation points (Δ-points) to each finger-print/language. Returns ordered array with key -> basename of .lm file (language) and value -> no. of Δ-points. The optional third parameter $max_delta (may) reduce evaluation time when set smaller than 140,000 (350*400).
  6. <?php
    function compareNGrams($sub_ng$lm_ng$max_delta 140000) {
        foreach (
    $lm_ng as $lm_basename => $language) {
            
    $delta 0;
            
    //compare each ngram of input text to current lm-array
            
    foreach ($sub_ng as $key => $existing_ngram){
                
    //match
                
    if(in_array($existing_ngram$language)) {
                    
    $delta += abs($key array_search($existing_ngram$language));
                
    //no match
                
    } else {
                    
    $delta += 400;
                }
                
    //abort: this language already differs too much
                
    if ($delta $max_delta) {
                    break;
                 }
            } 
    //end comparison with current language
            //include only non-aborted languages in result array
            
    if ($delta < ($max_delta 400)) {
                
    $result[$lm_basename] = $delta;
            }
        } 
    //end comparison all languages
        
    if(!isset($result)) {
          
    $result '';
        } else {
            
    asort($result);
         }
        return 
    $result;
    }
    ?>
  7. Utility function for language recognition in text files
  8. <?php
    function extractText($readfile$limit_lines = -1) {
        
    $string '';
        if (
    is_file($readfile)) {
            
    $handle fopen($readfile'r');
             
    $line_num 1;
            while (!
    feof($handle)) {
                
    //default -1 (read all lines)
                
    if ($limit_lines == $line_num){
                    break;
                  }
                  
    //line with max length of 2^19
                
    $line trim(fgets($handle528288));
                if (
    $line != "") {
                    
    $string .= " "$line;
                    
    $line_num++;
                }
            }
            
    fclose($handle);
        } else {echo 
    "*** Text file NOT FOUND<br>";}
        return 
    $string;
    }
    ?>



Simple Demo Example

For testing your onw php language-detection implementation you can copy&paste the code below, download the finger-prints (zip) and change the path information at the begining of the script to where you saved the finger-prints. Take a quick look at the Demo.

<?php

//path to your finger-print directory
define('FINGERPRINT'$_SERVER['DOCUMENT_ROOT'].'/path/to/fingerprints/');

//set the value of $max_delta to 80000 and/or reduce the number of
//fingerprints in your directory if you want to speed up on the evaluation time
//if you set the value of $max_delta too low, no language will be recognized
$max_delta 100000;  //(best evaluation is 140000 with original definitions)

//************* The 3 basic N-Gram functions  *********************************//
function getFingerprint($dir$nb_grams 400) {
    
$pattern "*.lm";
    
chdir($dir);
    
$files glob($pattern);
    foreach (
$files as $readfile) {
        if (
is_file($readfile)) {
            
$bsnm basename($readfile".lm");
            
$handle fopen($readfile'r');
            for (
$i=0$i $nb_grams$i++) {
                
$line fgets($handle);
                
$part explode(" "$line);
                
$lm_ng[$bsnm][]= trim($part[0]);
            }
        }
    }
    return 
$lm_ng;
}
function 
createNGrams($string$ng_number=350$ng_max_chars=4) {
    
$array_words explode(" "$string);
    
//iterate over each word, each character, all possible n-grams
    
foreach($array_words as $word) {
        
$word "_"$word "_";
        
$length strlen($word);
        for (
$pos=0$pos $length$pos++){ //start position within word
            
for ($chars=0$chars<$ng_max_chars$chars++) { //length of ngram
                
if (($pos $chars) < $length) { //if not beyond end of word
                     
$array_ngram[] = substr($word$pos$chars+1);
                 }
             }
         }
    }
    
//count-> value(frequency, int)... key(ngram, string)
    
$ng_frequency array_count_values($array_ngram);
    
//sort array by value(frequency) desc
    
arsort($ng_frequency);
    
//use only top frequent ngrams
    
$most_frequent array_slice($ng_frequency0$ng_number);
    foreach (
$most_frequent as $ng => $number_frequencey){
        
$sub_ng[] = $ng;
    }
    return 
$sub_ng;
}
function 
compareNGrams($sub_ng$lm_ng$max_delta 140000) {
    foreach (
$lm_ng as $lm_basename => $language) {
        
$delta 0;
        
//compare each ngram of input text to current lm-array
        
foreach ($sub_ng as $key => $existing_ngram){
            
//match
            
if(in_array($existing_ngram$language)) {
                
$delta += abs($key array_search($existing_ngram$language));
            
//no match
            
} else {
                
$delta += 400;
            }
            
//abort: this language already differs too much
            
if ($delta $max_delta) {
                break;
             }
        } 
// End comparison with current language
        //include only non-aborted languages in result array
        
if ($delta < ($max_delta 400)) {
            
$result[$lm_basename] = $delta;
        }
    } 
//End comparison all languages
    
if(!isset($result)) {
      
$result '';
    } else {
        
asort($result);
    }
    return 
$result;
}
//*******************************************************************//





/* When Form Submitted */
if(isset($_POST['dropdown']) && $_POST['dropdown'] != '') {

    
$string stripslashes($_POST['dropdown']);
 
    
/* N-Gram Functions */
        
$lm_ng getFingerprint(FINGERPRINT);
        
$sub_ng createNGrams($string);
        
$result_array compareNGrams($sub_ng$lm_ng$max_delta);
    
//First item in result array is best matching language
    
list($result$point) = each($result_array);
    
    
/* Display result, text, resutlt_array */
    //result: language
    
$display .= '<div id="result">'ucfirst($result) .'</div>';
    
//show list of best matching languages
    
$display .= '<div id="info"><table><tr><th>Candidates
                    </th><th>&Delta;-points</th></tr>'
;
    foreach (
$result_array as $lang => $points ) {
        
$display .= '<tr><td>'$lang .'</td><td>'number_format($points0'.'',') .'</td></tr>';
    }
    
$display .= '</table></div>';
} else {
      
$display .= '<br>&nbsp; Path to Finger-Prints:<br>.........<b>'FINGERPRINT .'</b>';
}

/*test STRING examples */
$eng "This one is rather easy to recognize";
$por "Eu não entendo como é possível uma coisa dessa. Parece grego";
$esp "No tengo mucho tiempo para pensar lo que las palabras quieren decir";
$fra "Je ne peut pas entendre ce qui se passe. On a besoin de réfléchir plus";
$ger "Also diese Sprache kenne ich überhaupt nicht. Klingt spanisch";
$ita "Io non capisco. Me lo puo' spiegare, per piacere ";
$bre "Ar c'helenner er pezh a c'hoarveze, met ne zeuent ket a-benn";
$rsh "I tagliaivan sü ils lefs e'l nas, digl veiver ansemmen";
$afr "Nie wat ek van weet nie. Wat beteken dit?";
$dth "Er beginstadium vrijwel altijd leuker en spannender zijn";
$cat "Nits i efecte de garantir d'ara en endavant la l'estructura del Patronat";
$nor "for de som hadde fortalt ham om melken også duften nå ";
$cro "Nije se seksom i sentimentalne pjesme komponirane u cijelosti literaturu";
$lat "Lorem ipsum dolor sit amet consectetur adipisicing elit";


?>
<!doctype html public "-//W3C//DTD HTML 4.01//EN">
<html>
   <head>
        <title>NG-test</title>
        <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
        <style>
        #result{
            margin:50px;
            font-size:28px;
        }
        #info{
            position:absolute;
            left:490px;
            top:25px;
            width: 200px;
            padding-left: 5px;
            font-size:10px;
            background-color: #F0F0F0;;
            border: 2px #e2e2e2 solid;
        }
        body {
            font-size:10px;
            background-color:MintCream;
            font-family:verdana;
        }
        </style>
    </head>
<body>
   
<h3>Testing the N-Gram functions</h3>
<ul><li> Below is a selection of some sentences in different languages
</li><li>Click on a line to initiate language detection
</li></ul>
<br><br>

<form method="post" name="testform" action="<?php echo $_SERVER['PHP_SELF'?>">
<select size="12" name="dropdown" onClick="document.testform.submit();">

<option value="<?php echo $eng ?>"><?php echo $eng ?></option>
<option value="<?php echo $lat ?>"><?php echo $lat ?></option>
<option value="<?php echo $esp ?>"><?php echo $esp ?></option>
<option value="<?php echo $fra ?>"><?php echo $fra ?></option>
<option value="<?php echo $nor ?>"><?php echo $nor ?></option>
<option value="<?php echo $por ?>"><?php echo $por ?></option>
<option value="<?php echo $ger ?>"><?php echo $ger ?></option>
<option value="<?php echo $ita ?>"><?php echo $ita ?></option>
<option value="<?php echo $rsh ?>"><?php echo $rsh ?></option>
<option value="<?php echo $dth ?>"><?php echo $dth ?></option>
<option value="<?php echo $afr ?>"><?php echo $afr ?></option>
<option value="<?php echo $cat ?>"><?php echo $cat ?></option>
<option value="<?php echo $cro ?>"><?php echo $cro ?></option>
<option value="<?php echo $bre ?>"><?php echo $bre ?></option>
</select>
</form>

   
   
   
    <?php echo $display ?>
   </body>
</html>
  Afrikaans
  Albanian
  Alemannic
  Amharic
  Arabic
  Armenian
  Basque
  Belarusian
  Bosnian
  Breton
  Bulgarian
  Catalan
  Chinese
  Croatian
  Czech
  Danish
  Dutch
  English
  Esperanto
  Estonian
  Finnish
  French
  Frisian
  Georgian
  German
  Greek
  Hawaian
  Hebrew
  Hindi
  Hungarian
  Icelandic
  Indonesian
  Irish_gaelic
  Italian
  Japanese
  Korean
  Latin
  Latvian
  Lithuanian
  Malay
  Manx
  Marathi
  Middle_frisian
  Mingo_iroquois
  Nepali
  Norwegian
  Persian_farsi
  Polish
  Portuguese_brazil
  Portuguese_europe
  Quechua
  Romanian
  Rumantsch
  Russian
  Sanskrit
  Scots
  Scots_gaelic
  Serbian
  Serbian_cyrillic
  Slovak
  Slovenian
  Spanish
  Swahili
  Swedish
  Tagalog
  Tamil
  Thai
  Turkish
  Ukrainian
  Vietnamese
  Welsh
  Yiddish