The basic PHP N-gram Functions
These short functions, combined with an extensive set of finger-prints, do all the work for a
Statistical Language Detection. There's also a simple script at the bottom of this
page (see this demo).
- Function reads all n-grams of
finger-prints into a multidimensional array
where the first dimension is a string with the language taken from the filename of the *.lm file, following the naming convention 〈language-encoding〉.lm
(e.g. polish-iso8859_2.lm ), the second dimension contains 400 n-grams each (all finger-prints available for download have 400 pieces of n-grams with the size of 1-to-4 letters).
<?php
function getFingerprint($dir, $nb_grams = 400) {
$pattern = "*.lm";
chdir($dir);
$files = glob($pattern);
foreach ($files as $readfile) {
if (is_file($readfile)) {
$bsnm = basename($readfile, ".lm");
$handle = fopen($readfile, 'r');
for ($i=0; $i < $nb_grams; $i++) {
$line = fgets($handle);
$part = explode(" ", $line);
$lm_ng[$bsnm][]= trim($part[0]);
}
}
}
return $lm_ng;
}
?>
- Creates an array of n-grams of the given string ($string) that needs to be analyzed.
Array contains not more than 350 of the top frequent n-grams ($ng_number), starting with the most
frequent one, where the key contains the n-gram and the value contains the
normalized ranking number.
The third, optional parameter ($ng_max_chars) defines the max. number of characters of a
n-grams. If you use the downloaded finger-prints from this website, then this parameter mustn't
be changed. This function can also be used as a basis to generate a finger-print for a new language
(also see: PHP Class LangDetect).
<?php
function createNGrams($string, $ng_number=350, $ng_max_chars=4) {
$array_words = explode(" ", $string);
//iterate over each word, each character, all possible n-grams
foreach($array_words as $word) {
$word = "_". $word . "_";
$length = strlen($word);
for ($pos=0; $pos < $length; $pos++){ //pos within word
for ($chars=0; $chars<$ng_max_chars; $chars++) {//ng-length
if (($pos + $chars) < $length) {//not beyond end of word
$array_ngram[] = substr($word, $pos, $chars+1);
}
}
}
}
//count-> value(frequency, int)... key(ngram, string)
$ng_frequency = array_count_values($array_ngram);
//sort array by value(frequency) desc
arsort($ng_frequency);
//use only top frequent ngrams
$most_frequent = array_slice($ng_frequency, 0, $ng_number);
foreach ($most_frequent as $ng => $number_frequencey){
$sub_ng[] = $ng;
}
return $sub_ng;
}
?>
- Compares array of the submitted string ($sub_ng, created in function #2) to
the multidimentional finger-print array ($lm_ng, created in funtction #1)
and assigns a total number of ranking-deviation points
(Δ-points) to each finger-print/language. Returns ordered array
with key -> basename of .lm file (language) and value -> no. of Δ-points.
The optional third parameter $max_delta (may) reduce evaluation time when set smaller
than 140,000 (350*400).
<?php
function compareNGrams($sub_ng, $lm_ng, $max_delta = 140000) {
foreach ($lm_ng as $lm_basename => $language) {
$delta = 0;
//compare each ngram of input text to current lm-array
foreach ($sub_ng as $key => $existing_ngram){
//match
if(in_array($existing_ngram, $language)) {
$delta += abs($key - array_search($existing_ngram, $language));
//no match
} else {
$delta += 400;
}
//abort: this language already differs too much
if ($delta > $max_delta) {
break;
}
} //end comparison with current language
//include only non-aborted languages in result array
if ($delta < ($max_delta - 400)) {
$result[$lm_basename] = $delta;
}
} //end comparison all languages
if(!isset($result)) {
$result = '';
} else {
asort($result);
}
return $result;
}
?>
- Utility function for language recognition in text files
<?php
function extractText($readfile, $limit_lines = -1) {
$string = '';
if (is_file($readfile)) {
$handle = fopen($readfile, 'r');
$line_num = 1;
while (!feof($handle)) {
//default -1 (read all lines)
if ($limit_lines == $line_num){
break;
}
//line with max length of 2^19
$line = trim(fgets($handle, 528288));
if ($line != "") {
$string .= " ". $line;
$line_num++;
}
}
fclose($handle);
} else {echo "*** Text file NOT FOUND<br>";}
return $string;
}
?>
Simple DEMO example
For testing your onw php language-detection implementation you can copy&paste
the code below, download the finger-prints (zip) and
change the path information
at the begining of the script to where you saved the finger-prints.
Take a quick look at the Demo.
<?php
//path to your finger-print directory
define('FINGERPRINT', $_SERVER['DOCUMENT_ROOT'].'/path/to/fingerprints/');
//set the value of $max_delta to 80000 and/or reduce the number of
//fingerprints in your directory if you want to speed up on the evaluation time
//if you set the value of $max_delta too low, no language will be recognized
$max_delta = 100000; //(best evaluation is 140000 with original definitions)
//************* The 3 basic N-Gram functions *********************************//
function getFingerprint($dir, $nb_grams = 400) {
$pattern = "*.lm";
chdir($dir);
$files = glob($pattern);
foreach ($files as $readfile) {
if (is_file($readfile)) {
$bsnm = basename($readfile, ".lm");
$handle = fopen($readfile, 'r');
for ($i=0; $i < $nb_grams; $i++) {
$line = fgets($handle);
$part = explode(" ", $line);
$lm_ng[$bsnm][]= trim($part[0]);
}
}
}
return $lm_ng;
}
function createNGrams($string, $ng_number=350, $ng_max_chars=4) {
$array_words = explode(" ", $string);
//iterate over each word, each character, all possible n-grams
foreach($array_words as $word) {
$word = "_". $word . "_";
$length = strlen($word);
for ($pos=0; $pos < $length; $pos++){ //start position within word
for ($chars=0; $chars<$ng_max_chars; $chars++) { //length of ngram
if (($pos + $chars) < $length) { //if not beyond end of word
$array_ngram[] = substr($word, $pos, $chars+1);
}
}
}
}
//count-> value(frequency, int)... key(ngram, string)
$ng_frequency = array_count_values($array_ngram);
//sort array by value(frequency) desc
arsort($ng_frequency);
//use only top frequent ngrams
$most_frequent = array_slice($ng_frequency, 0, $ng_number);
foreach ($most_frequent as $ng => $number_frequencey){
$sub_ng[] = $ng;
}
return $sub_ng;
}
function compareNGrams($sub_ng, $lm_ng, $max_delta = 140000) {
foreach ($lm_ng as $lm_basename => $language) {
$delta = 0;
//compare each ngram of input text to current lm-array
foreach ($sub_ng as $key => $existing_ngram){
//match
if(in_array($existing_ngram, $language)) {
$delta += abs($key - array_search($existing_ngram, $language));
//no match
} else {
$delta += 400;
}
//abort: this language already differs too much
if ($delta > $max_delta) {
break;
}
} // End comparison with current language
//include only non-aborted languages in result array
if ($delta < ($max_delta - 400)) {
$result[$lm_basename] = $delta;
}
} //End comparison all languages
if(!isset($result)) {
$result = '';
} else {
asort($result);
}
return $result;
}
//*******************************************************************//
/* When Form Submitted */
if(isset($_POST['dropdown']) && $_POST['dropdown'] != '') {
$string = stripslashes($_POST['dropdown']);
/* N-Gram Functions */
$lm_ng = getFingerprint(FINGERPRINT);
$sub_ng = createNGrams($string);
$result_array = compareNGrams($sub_ng, $lm_ng, $max_delta);
//First item in result array is best matching language
list($result, $point) = each($result_array);
/* Display result, text, resutlt_array */
//result: language
$display .= '<div id="result">'. ucfirst($result) .'</div>';
//show list of best matching languages
$display .= '<div id="info"><table><tr><th>Candidates
</th><th>Δ-points</th></tr>';
foreach ($result_array as $lang => $points ) {
$display .= '<tr><td>'. $lang .'</td><td>'. number_format($points, 0, '.', ',') .'</td></tr>';
}
$display .= '</table></div>';
} else {
$display .= '<br> Path to Finger-Prints:<br>.........<b>'. FINGERPRINT .'</b>';
}
/*test STRING examples */
$eng = "This one is rather easy to recognize";
$por = "Eu não entendo como é possível uma coisa dessa. Parece grego";
$esp = "No tengo mucho tiempo para pensar lo que las palabras quieren decir";
$fra = "Je ne peut pas entendre ce qui se passe. On a besoin de réfléchir plus";
$ger = "Also diese Sprache kenne ich überhaupt nicht. Klingt spanisch";
$ita = "Io non capisco. Me lo puo' spiegare, per piacere ";
$bre = "Ar c'helenner er pezh a c'hoarveze, met ne zeuent ket a-benn";
$rsh = "I tagliaivan sü ils lefs e'l nas, digl veiver ansemmen";
$afr = "Nie wat ek van weet nie. Wat beteken dit?";
$dth = "Er beginstadium vrijwel altijd leuker en spannender zijn";
$cat = "Nits i efecte de garantir d'ara en endavant la l'estructura del Patronat";
$nor = "for de som hadde fortalt ham om melken også duften nå ";
$cro = "Nije se seksom i sentimentalne pjesme komponirane u cijelosti literaturu";
$lat = "Lorem ipsum dolor sit amet consectetur adipisicing elit";
?>
<!doctype html public "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>NG-test</title>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<style>
#result{
margin:50px;
font-size:28px;
}
#info{
position:absolute;
left:490px;
top:25px;
width: 200px;
padding-left: 5px;
font-size:10px;
background-color: #F0F0F0;;
border: 2px #e2e2e2 solid;
}
body {
font-size:10px;
background-color:MintCream;
font-family:verdana;
}
</style>
</head>
<body>
<h3>Testing the N-Gram functions</h3>
<ul><li> Below is a selection of some sentences in different languages
</li><li>Click on a line to initiate language detection
</li></ul>
<br><br>
<form method="post" name="testform" action="<?php echo $_SERVER['PHP_SELF'] ?>">
<select size="12" name="dropdown" onClick="document.testform.submit();">
<option value="<?php echo $eng ?>"><?php echo $eng ?></option>
<option value="<?php echo $lat ?>"><?php echo $lat ?></option>
<option value="<?php echo $esp ?>"><?php echo $esp ?></option>
<option value="<?php echo $fra ?>"><?php echo $fra ?></option>
<option value="<?php echo $nor ?>"><?php echo $nor ?></option>
<option value="<?php echo $por ?>"><?php echo $por ?></option>
<option value="<?php echo $ger ?>"><?php echo $ger ?></option>
<option value="<?php echo $ita ?>"><?php echo $ita ?></option>
<option value="<?php echo $rsh ?>"><?php echo $rsh ?></option>
<option value="<?php echo $dth ?>"><?php echo $dth ?></option>
<option value="<?php echo $afr ?>"><?php echo $afr ?></option>
<option value="<?php echo $cat ?>"><?php echo $cat ?></option>
<option value="<?php echo $cro ?>"><?php echo $cro ?></option>
<option value="<?php echo $bre ?>"><?php echo $bre ?></option>
</select>
</form>
<?php echo $display ?>
</body>
</html>