2012-07-08 65 views
0

我使用php類從文章製作標籤雲,但我想刪除只有3個字符或更少的單詞,也刪除數字單詞。刪除單詞,只有3個字符或更少用PHP

例如標籤:1111猴鹿貓豬水牛

我想要的結果是:猴鹿水牛從該類

PHP代碼(完整的代碼here

function keywords_extract($text) 
{ 
    $text = strtolower($text); 
    $text = strip_tags($text); 

    /* 
    * Handle common words first because they have punctuation and we need to remove them 
    * before removing punctuation. 
    */ 
    $commonWords = "'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't," . 
     "as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't," . 
     "don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his," . 
     "how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like," . 
     "likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,o'clock,of,off," . 
     "often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've," . 
     "shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd," . 
     "they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what," . 
     "what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd," . 
     "who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll," . 

    $commonWords = strtolower($commonWords); 
    $commonWords = explode(",", $commonWords); 
    foreach($commonWords as $commonWord) 
    { 
     $text = $this->str_replace_word($commonWord, "", $text); 
    } 

    /* remove punctuation and newlines */ 
    /* 
    * Changed to handle international characters 
    */ 
    if ($this->m_bUTF8) 
     $text = preg_replace('/[^\p{L}0-9\s]|\n|\r/u',' ',$text); 
    else 
     $text = preg_replace('/[^a-zA-Z0-9\s]|\n|\r/',' ',$text); 

    /* remove extra spaces created */ 
    $text = preg_replace('/ +/',' ',$text); 
    $text = trim($text); 
    $words = explode(" ", $text); 
    foreach ($words as $value) 
    { 
     $temp = trim($value); 
     if (is_numeric($temp)) 
      continue; 
     $keywords[] = trim($temp); 
    } 
    return $keywords; 
} 

我已經試過各種方式,如使用if (strlen($words)<3 && is_numeric($words)==true),但它沒有奏效。

請幫我

+0

。 ..'is_numeric($ words)== true')是不可靠的。它應該是'if(strlen($ words)<3 && is_numeric($ words))'。更準確地說,你應該首先執行數字檢查,如果你想這樣檢查if(is_numeric($ words)&& strlen($ words)<3)'。 – Lion 2012-07-08 04:21:05

+0

@Lion:但即使是前者也應該有效。 [The Manual](http://php.net/manual/en/function.is-numeric.php)表示它只返回true或false。 – Shubham 2012-07-08 04:28:07

回答

1

我會稍微修改您的進程以使其運行速度更快(我相信它應該)

第一步:我不會將每個常用詞替換爲$text中的空字符串(替換過程很昂貴),我會將每個常用詞存儲在哈希表中以供以後過濾。

$commonWords = explode(",", $commonWords); 
foreach($commonWords as $commonWord) 
    $hashWord[$commonWord] = $commonWord; 

步驟2:濾波器公共字,數字和含有少於4位數字在同一時間的話。

$words = preg_split("/[\s\n\r]/", $text); 
foreach ($words as $value) 
{ 
    // Skip it is common word 
    if (isset($hashWord[$value])) continue; 
    // Skip if it is numeric 
    if (is_numeric($value)) continue; 
    // Skip if word contains less than 4 digits 
    if (strlen($value) < 4) continue; 

    $keywords[] = preg_replace('/[^a-zA-Z0-9\s].+/', '', $value); 
} 

以下是該功能(要複製的情況下,和粘貼)一個完整的源代碼

function keywords_extract($text) { 
    $text = strtolower($text); 
    $text = strip_tags($text); 

    $commonWords = "'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't," . 
     "as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't," . 
     "don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his," . 
     "how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like," . 
     "likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,o'clock,of,off," . 
     "often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've," . 
     "shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd," . 
     "they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what," . 
     "what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd," . 
     "who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll,"; 

    $commonWords = explode(",", $commonWords); 
    foreach($commonWords as $commonWord) 
     $hashWord[$commonWord] = $commonWord; 

    $words = preg_split("/[\s\n\r]/", $text); 
    foreach ($words as $value) 
    { 
     // Skip it is common word 
     if (isset($hashWord[$value])) continue; 
     // Skip if it is numeric 
     if (is_numeric($value)) continue; 
     // Skip if word contains less than 4 digits 
     if (strlen($value) < 4) continue; 

     $keywords[] = preg_replace('/[^a-zA-Z0-9\s].+/', '', $value); 
    } 
    return $keywords; 
} 

演示:ideone.com/obG6n

+0

感謝您的幫助 – 2012-07-08 06:37:14

+0

當我運行我的頁面時,它變成空白(錯誤),但是當我在$ $ hashWord [$ commonWord] = $ commonWord之間添加{和}時,它顯示錯誤'警告:preg_replace()...'和'警告:爲foreach()提供的無效參數' – 2012-07-08 07:00:50

+0

它可以正常使用我的電腦。檢查這個http://ideone.com/obG6n – invisal 2012-07-08 07:28:44

0
If((strlen($word) <= 3) && is_numeric($words)){ 
    //Don't add in the list 
} 
1

,則應該更換&&||
來自:
if (strlen($words)<3 && is_numeric($words)==true)
到:
if (strlen($words)<3 || is_numeric($words)==true)

,如果你想刪除有話 3個字符或更少,
那麼你應該使用<=而不是<

/* remove extra spaces created */ 
$text = preg_replace('/ +/',' ',$text); 
$text = trim($text); 
$words = explode(" ", $text); 

到:
if (strlen($words) <= 3 || is_numeric($words)==true)

+0

應該在哪裏更改? – Lion 2012-07-08 04:39:11

+0

@Lion更新了我的答案 – alfasin 2012-07-08 04:42:34

+0

如何?它必須是'&&'而不是'''。變量'$ words'必須是數字**以及**其長度必須小於或等於3.(**不是**他們中的任何一個,但是他們同時應該被滿足)。 – Lion 2012-07-08 04:50:38

1

你可以用正則表達式

變化做

/* remove extra spaces created */ 
$words = preg_replace('/\b\w{1,3}\s|[0-9]/gi','',$text); 
return $words; 

並刪除下面的foreach部分包括返回;

這裏是正則表達式的解釋:

\b = Match a word boundary position (whitespace or the beginning/end of the string). 
\w = Match any word character (alphanumeric & underscore). 
{1,3} = Matches 1 to 3 of the preceeding token. 
\s = Match any whitespace character (spaces, tabs, line breaks). 
| = or. 
[0-9] = Match any numeric character. 

這裏是這種模式的人可以理解的解釋: 「查找從起始位置的長度--has任何單詞字符一個字1或3個字符和一個以下空格 - 或 - 數字字符 - 並將其替換爲空字符串。

+0

感謝您的幫助 – 2012-07-08 06:37:34

+0

現在我使用'$ text = preg_replace('!\\ b \\ w {1,3} \\ b!','',$ text); '它對我有用:) – 2012-07-08 14:55:56

0

現在我加$text = preg_replace('!\\b\\w{1,3}\\b!', ' ', $text);

$text = preg_replace('/ +/',' ',$text); 
    $text = trim($text); 
    $words = explode(" ", $text); 

,如果你想使用這個PHP類沒有錯誤:)

source

,你可以得到代碼here

感謝所有:)