The basic PHP N-gram Functions
In order to implement your own language detection download class LangDetect with the ready to use library of fingerprints
These short functions, combined with an extensive set of finger-prints, do all the work for a Statistical Language Detection. There's also a simple script at the bottom of this page (see this demo).
- Function reads all n-grams of finger-prints into a multidimensional array where the first dimension is a string with the language taken from the filename of the *.lm file, following the naming convention <language-encoding>.lm (e.g. polish-iso8859_2.lm ), the second dimension contains 400 n-grams each (all finger-prints available for download have 400 pieces of n-grams with the size of 1-to-4 letters).
- Creates an array of n-grams of the given string ($string) that needs to be analyzed. Array contains not more than 350 of the top frequent n-grams ($ng_number), starting with the most frequent one, where the key contains the n-gram and the value contains the normalized ranking number. The third, optional parameter ($ng_max_chars) defines the max. number of characters of a n-grams. If you use the downloaded finger-prints from this website, then this parameter mustn't be changed. This function can also be used as a basis to generate a finger-print for a new language (also see: PHP Class LangDetect).
- Compares array of the submitted string ($sub_ng, created in function #2) to the multidimentional finger-print array ($lm_ng, created in funtction #1) and assigns a total number of ranking-deviation points (Δ-points) to each finger-print/language. Returns ordered array with key -> basename of .lm file (language) and value -> no. of Δ-points. The optional third parameter $max_delta (may) reduce evaluation time when set smaller than 140,000 (350*400).
- Utility function for language recognition in text files
<?php
function getFingerprint($dir, $nb_grams = 400) {
$pattern = "*.lm";
chdir($dir);
$files = glob($pattern);
foreach ($files as $readfile) {
if (is_file($readfile)) {
$bsnm = basename($readfile, ".lm");
$handle = fopen($readfile, 'r');
for ($i=0; $i < $nb_grams; $i++) {
$line = fgets($handle);
$part = explode(" ", $line);
$lm_ng[$bsnm][]= trim($part[0]);
}
}
}
return $lm_ng;
}
?>
<?php
function createNGrams($string, $ng_number=350, $ng_max_chars=4) {
$array_words = explode(" ", $string);
//iterate over each word, each character, all possible n-grams
foreach($array_words as $word) {
$word = "_". $word . "_";
$length = strlen($word);
for ($pos=0; $pos < $length; $pos++){ //pos within word
for ($chars=0; $chars<$ng_max_chars; $chars++) {//ng-length
if (($pos + $chars) < $length) {//not beyond end of word
$array_ngram[] = substr($word, $pos, $chars+1);
}
}
}
}
//count-> value(frequency, int)... key(ngram, string)
$ng_frequency = array_count_values($array_ngram);
//sort array by value(frequency) desc
arsort($ng_frequency);
//use only top frequent ngrams
$most_frequent = array_slice($ng_frequency, 0, $ng_number);
foreach ($most_frequent as $ng => $number_frequencey){
$sub_ng[] = $ng;
}
return $sub_ng;
}
?>
<?php
function compareNGrams($sub_ng, $lm_ng, $max_delta = 140000) {
foreach ($lm_ng as $lm_basename => $language) {
$delta = 0;
//compare each ngram of input text to current lm-array
foreach ($sub_ng as $key => $existing_ngram){
//match
if(in_array($existing_ngram, $language)) {
$delta += abs($key - array_search($existing_ngram, $language));
//no match
} else {
$delta += 400;
}
//abort: this language already differs too much
if ($delta > $max_delta) {
break;
}
} //end comparison with current language
//include only non-aborted languages in result array
if ($delta < ($max_delta - 400)) {
$result[$lm_basename] = $delta;
}
} //end comparison all languages
if(!isset($result)) {
$result = '';
} else {
asort($result);
}
return $result;
}
?>
<?php
function extractText($readfile, $limit_lines = -1) {
$string = '';
if (is_file($readfile)) {
$handle = fopen($readfile, 'r');
$line_num = 1;
while (!feof($handle)) {
//default -1 (read all lines)
if ($limit_lines == $line_num){
break;
}
//line with max length of 2^19
$line = trim(fgets($handle, 528288));
if ($line != "") {
$string .= " ". $line;
$line_num++;
}
}
fclose($handle);
} else {echo "*** Text file NOT FOUND<br>";}
return $string;
}
?>
Simple Demo Example
For testing your onw php language-detection implementation you can copy&paste the code below, download the finger-prints (zip) and change the path information at the begining of the script to where you saved the finger-prints. Take a quick look at the Demo.
<?php
//path to your finger-print directory
define('FINGERPRINT', $_SERVER['DOCUMENT_ROOT'].'/path/to/fingerprints/');
//set the value of $max_delta to 80000 and/or reduce the number of
//fingerprints in your directory if you want to speed up on the evaluation time
//if you set the value of $max_delta too low, no language will be recognized
$max_delta = 100000; //(best evaluation is 140000 with original definitions)
//************* The 3 basic N-Gram functions *********************************//
function getFingerprint($dir, $nb_grams = 400) {
$pattern = "*.lm";
chdir($dir);
$files = glob($pattern);
foreach ($files as $readfile) {
if (is_file($readfile)) {
$bsnm = basename($readfile, ".lm");
$handle = fopen($readfile, 'r');
for ($i=0; $i < $nb_grams; $i++) {
$line = fgets($handle);
$part = explode(" ", $line);
$lm_ng[$bsnm][]= trim($part[0]);
}
}
}
return $lm_ng;
}
function createNGrams($string, $ng_number=350, $ng_max_chars=4) {
$array_words = explode(" ", $string);
//iterate over each word, each character, all possible n-grams
foreach($array_words as $word) {
$word = "_". $word . "_";
$length = strlen($word);
for ($pos=0; $pos < $length; $pos++){ //start position within word
for ($chars=0; $chars<$ng_max_chars; $chars++) { //length of ngram
if (($pos + $chars) < $length) { //if not beyond end of word
$array_ngram[] = substr($word, $pos, $chars+1);
}
}
}
}
//count-> value(frequency, int)... key(ngram, string)
$ng_frequency = array_count_values($array_ngram);
//sort array by value(frequency) desc
arsort($ng_frequency);
//use only top frequent ngrams
$most_frequent = array_slice($ng_frequency, 0, $ng_number);
foreach ($most_frequent as $ng => $number_frequencey){
$sub_ng[] = $ng;
}
return $sub_ng;
}
function compareNGrams($sub_ng, $lm_ng, $max_delta = 140000) {
foreach ($lm_ng as $lm_basename => $language) {
$delta = 0;
//compare each ngram of input text to current lm-array
foreach ($sub_ng as $key => $existing_ngram){
//match
if(in_array($existing_ngram, $language)) {
$delta += abs($key - array_search($existing_ngram, $language));
//no match
} else {
$delta += 400;
}
//abort: this language already differs too much
if ($delta > $max_delta) {
break;
}
} // End comparison with current language
//include only non-aborted languages in result array
if ($delta < ($max_delta - 400)) {
$result[$lm_basename] = $delta;
}
} //End comparison all languages
if(!isset($result)) {
$result = '';
} else {
asort($result);
}
return $result;
}
//*******************************************************************//
/* When Form Submitted */
if(isset($_POST['dropdown']) && $_POST['dropdown'] != '') {
$string = stripslashes($_POST['dropdown']);
/* N-Gram Functions */
$lm_ng = getFingerprint(FINGERPRINT);
$sub_ng = createNGrams($string);
$result_array = compareNGrams($sub_ng, $lm_ng, $max_delta);
//First item in result array is best matching language
list($result, $point) = each($result_array);
/* Display result, text, resutlt_array */
//result: language
$display .= '<div id="result">'. ucfirst($result) .'</div>';
//show list of best matching languages
$display .= '<div id="info"><table><tr><th>Candidates
</th><th>Δ-points</th></tr>';
foreach ($result_array as $lang => $points ) {
$display .= '<tr><td>'. $lang .'</td><td>'. number_format($points, 0, '.', ',') .'</td></tr>';
}
$display .= '</table></div>';
} else {
$display .= '<br> Path to Finger-Prints:<br>.........<b>'. FINGERPRINT .'</b>';
}
/*test STRING examples */
$eng = "This one is rather easy to recognize";
$por = "Eu não entendo como é possível uma coisa dessa. Parece grego";
$esp = "No tengo mucho tiempo para pensar lo que las palabras quieren decir";
$fra = "Je ne peut pas entendre ce qui se passe. On a besoin de réfléchir plus";
$ger = "Also diese Sprache kenne ich überhaupt nicht. Klingt spanisch";
$ita = "Io non capisco. Me lo puo' spiegare, per piacere ";
$bre = "Ar c'helenner er pezh a c'hoarveze, met ne zeuent ket a-benn";
$rsh = "I tagliaivan sü ils lefs e'l nas, digl veiver ansemmen";
$afr = "Nie wat ek van weet nie. Wat beteken dit?";
$dth = "Er beginstadium vrijwel altijd leuker en spannender zijn";
$cat = "Nits i efecte de garantir d'ara en endavant la l'estructura del Patronat";
$nor = "for de som hadde fortalt ham om melken også duften nå ";
$cro = "Nije se seksom i sentimentalne pjesme komponirane u cijelosti literaturu";
$lat = "Lorem ipsum dolor sit amet consectetur adipisicing elit";
?>
<!doctype html public "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>NG-test</title>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<style>
#result{
margin:50px;
font-size:28px;
}
#info{
position:absolute;
left:490px;
top:25px;
width: 200px;
padding-left: 5px;
font-size:10px;
background-color: #F0F0F0;;
border: 2px #e2e2e2 solid;
}
body {
font-size:10px;
background-color:MintCream;
font-family:verdana;
}
</style>
</head>
<body>
<h3>Testing the N-Gram functions</h3>
<ul><li> Below is a selection of some sentences in different languages
</li><li>Click on a line to initiate language detection
</li></ul>
<br><br>
<form method="post" name="testform" action="<?php echo $_SERVER['PHP_SELF'] ?>">
<select size="12" name="dropdown" onClick="document.testform.submit();">
<option value="<?php echo $eng ?>"><?php echo $eng ?></option>
<option value="<?php echo $lat ?>"><?php echo $lat ?></option>
<option value="<?php echo $esp ?>"><?php echo $esp ?></option>
<option value="<?php echo $fra ?>"><?php echo $fra ?></option>
<option value="<?php echo $nor ?>"><?php echo $nor ?></option>
<option value="<?php echo $por ?>"><?php echo $por ?></option>
<option value="<?php echo $ger ?>"><?php echo $ger ?></option>
<option value="<?php echo $ita ?>"><?php echo $ita ?></option>
<option value="<?php echo $rsh ?>"><?php echo $rsh ?></option>
<option value="<?php echo $dth ?>"><?php echo $dth ?></option>
<option value="<?php echo $afr ?>"><?php echo $afr ?></option>
<option value="<?php echo $cat ?>"><?php echo $cat ?></option>
<option value="<?php echo $cro ?>"><?php echo $cro ?></option>
<option value="<?php echo $bre ?>"><?php echo $bre ?></option>
</select>
</form>
<?php echo $display ?>
</body>
</html>