假设我有这样一个数组:
新阵列将有:
如何将每个字符串与PHP列表中的每个其他字符串进行比较,如果它们相似,则将它们删除。
我认为类似的情况如下:
另一个例子是:
发布于 2010-09-10 02:10:23
你有多种选择。
对于每个选项,在进行比较之前,您可能应该先按摩相册的名称。您可以通过摘除标点符号,按字母顺序(在某些情况下)对相册名称中的单词进行排序等方法来做到这一点。
在每种情况下,当您进行比较时,如果从数组中删除一个相册名称,则您的比较是顺序敏感的,除非您制定了要删除哪个专辑名称的规则。因此,如果比较两个相册名称并发现它们“相似”,则始终删除较长的相册名称可能是有意义的。
主要比较选项如下
similar_text()更有效。你应该去掉标点符号,按字母顺序排列单词。不管怎样..。这里有两个解决方案。
第一种使用similar_text()..。但它计算的相似性只有,毕竟标点符号已被删除,单词按字母顺序排列,并降低了.缺点是你必须利用阈值的相似性..。第二个使用简单的不区分大小写的子字符串测试,在所有标点符号和空白被删除之后。
这两个代码片段的工作方式是使用array_walk()在数组中的每个相册上运行compare()函数。然后在compare()函数中,我使用前程()将当前相册与所有其他相册进行比较。有足够的空间让事情更有效率。
请注意,我应该在array_walk中使用第三个参数作为引用,有人能帮我做到这一点吗?当前的工作是一个全局变量:
实例化 (69%相似阈值)
function compare($value, $key)
{
global $array; // Should use 3rd argument of compare instead
$value = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value));
$value = explode(" ", $value);
sort($value);
$value = implode($value);
$value = preg_replace("/[\s]/", "", $value); // Remove any leftover \s
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse, and lower case the string
$value2 = strtolower(preg_replace("/[^a-zA-Z0-9 ]/", "", $value2));
$value2 = explode(" ", $value2);
sort($value2);
$value2 = implode($value2);
$value2 = preg_replace("/[\s]/", "", $value2);
// Set up the similarity
similar_text($value, $value2, $sim);
if ($sim > 69)
{ // Remove the longer album name
unset($array[ ((strlen($value) > strlen($value2))?$key:$key2) ]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
print_r($array);上述各项的产出如下:
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding
[6] => Band of Horses, On My Way Back Home
[7] => Band of Horses - cigarettes wedding bands
[8] => Band Of Horses - I Go To The Barn Because I Like The
[9] => Our Swords - Band of Horses
[10] => Band of Horses - Monsters
)请注意,短版本的玛丽的歌曲是失踪的..。因此,这肯定是对其他事情的假阳性,因为长版本还在名单中……但它们恰恰是你想要的专辑名称。
子字符串方法:
实例化
function compare($value, $key)
{
// I should be using &$array as a 3rd variable.
// For some reason couldn't get that to work, so I do this instead.
global $array;
// Take the current album name and remove all punctuation and white space
$value = preg_replace("/[^a-zA-Z0-9]/", "", $value);
// Compare current album to all othes
foreach($array as $key2 => $value2)
{
if ($key != $key2)
{
// collapse the album being compared to
$value2 = preg_replace("/[^a-zA-Z0-9]/", "", $value2);
$subject = $value2;
$pattern = '/' . $value . '/i';
// If there's a much remove the album being compared to
if (preg_match($pattern, $subject))
{
unset($array[$key2]);
}
}
}
}
array_walk($array, 'compare');
$array = array_values($array);
echo "<pre>";
print_r($array);
echo "</pre>";对于示例字符串,上面的输出(它显示了不想显示的2):
Array
(
[0] => Band of Horses - Is There a Ghost
[1] => Band Of Horses - No One's Gonna Love You
[2] => Band of Horses - The Funeral
[3] => Band of Horses - Laredo
[4] => Band of Horses - "The Great Salt Lake" Sub Pop Records
[5] => Band of Horses perform Marry Song at Tromso Wedding // <== Oops
[6] => 'Laredo' by Band of Horses on Q TV // <== Oops
[7] => Band of Horses, On My Way Back Home
[8] => Band of Horses - cigarettes wedding bands
[9] => Band Of Horses - I Go To The Barn Because I Like The
[10] => Our Swords - Band of Horses
[11] => Band Of Horses - "Marry song"
[12] => Band of Horses - Monsters
)发布于 2010-09-10 01:49:55
您可能希望尝试similar_text,也许与levenshtein结合使用,并通过实验确定您认为足够相似的分数的阈值。还可以查看用户讨论以获得更多的提示。然后,可以循环遍历数组,比较每个元素和其他元素,并删除您认为过于相似的元素。
我希望这对你来说是个开始。这个问题相当复杂,因为有许多东西可以被认为具有相同的内容,但语法却完全不同。“马带-我们的剑”)。这取决于这个相当简单的解决方案是否足以满足您所要做的工作。
发布于 2010-09-10 02:27:29
最佳实现将在很大程度上取决于您的数据。你对你的数据了解得越多,你就能用最少的工作量获得更好的结果。总之,这里有一个我放在一起的示例脚本:
<?php
$list = array(); # source data
$groups = array();
foreach ($list as $item)
{
$words = array_unique(explode(' ', trim(preg_replace('/[^a-z]+/', ' ', strtolower($item)))));
$matches = array();
foreach ($groups as $i => $group)
{
foreach ($group as $g)
{
if (count($words) < count($g['words']))
{
$a = $words;
$b = $g['words'];
}
else
{
$a = $g['words'];
$b = $words;
}
$c = 0;
foreach ($a as $word1)
{
foreach ($b as $word2)
{
if (levenshtein($word1, $word2) < 2)
{
++$c;
break;
}
}
}
if ($c / count($a) > 0.85)
{
$matches[] = $i;
continue 2;
}
}
}
$me = array('item' => $item, 'words' => $words);
if (!$matches)
$groups[] = array($me);
else
{
for ($i = 1; $i < count($matches); ++$i)
{
$groups[$matches[0]] = array_merge($groups[$matches[0]], $groups[$matches[$i]]);
unset($groups[$matches[$i]]);
}
$groups[$matches[0]][] = $me;
}
}
foreach ($groups as $group)
{
echo $group[0]['item']."\n";
for ($i = 1; $i < count($group); ++$i)
echo "\t".$group[$i]['item']."\n";
}
?>包含列表的输出:
Band of Horses - Is There a Ghost
Band Of Horses - No One's Gonna Love You
Band Of Horses - "No One's Gonna Love You"
Band Of Horses - No One's Gonna Love You
Band Of Horses - No One's Gonna Love You
Band of Horses - The Funeral
Band of Horses - The Funeral (lyrics in description)
Band of Horses - Laredo
Band Of Horses - Laredo on Letterman 5.20.10
'Laredo' by Band of Horses on Q TV
Band of Horses - "The Great Salt Lake" Sub Pop Records
Band of Horses perform Marry Song at Tromso Wedding
Band Of Horses - "Marry song"
Band of Horses, On My Way Back Home
Band of Horses - cigarettes wedding bands
Band Of Horses - "Cigarettes Wedding Bands"
Band Of Horses - I Go To The Barn Because I Like The
Our Swords - Band of Horses
Band of Horses - Monsters这里的基本原则是将类似的列表项分组在一起。传入的任何新项目都将与现有组进行比较。较短的项目与较大的项目进行核对。如果足够多的单词(85%)足够接近(两个字符不同),则将其视为匹配,并添加到列表中。
如果你调整参数,这对你来说就足够了。其他要考虑的事情:完全忽略小单词,类似的短语等。
https://stackoverflow.com/questions/3681668
复制相似问题