丁宇 | DING Yu

SPAM、Bayesian和中文 4 - 在CakePHP中集成贝叶斯算法

上文提到了贝叶斯算法的几种开源实现,本文说说如何将其中一种名为b8的开源实现集成进CakePHP。

下载b8及安装

  1. b8的站点下载最新版本,将其解压至vendors目录,文件位置如vendors/b8/b8.php;
  2. 用文本编辑器打开vendors/b8/etc/config_b8,修改databaseType为mysql;
  3. 用文本编辑器打开vendors/b8/etc/config_storage,修改tableName为你用来存储关键字的数据表的名字,修改createDB为TRUE,要注意的是,当你第一次运行b8后,它会建立上述数据表,然后你要重新把createDB改为FALSE;
  4. 用文本编辑器打开vendors/b8/lexer/shared_functions.php,将38行的代码(在echoError())注释掉,否则b8会直接把错误信息显示在你的Cake应用中,当然这在调试程序时还是有用的。

为b8写一个wrapper component

为了让你的Cake能够调用到b8,你需要写一个component。在controllers/components/新建一个spam_shield.php,加入如下代码:

class SpamShieldComponent extends Object
{
	/**
	 * b8 instance
	 */
	var $b8;
	
	/**
	 * standard rating
	 *
	 * comments with ratings which are higher than this one will be considered as SPAM
	 */
	var $standardRating = 0.7;
	
	/**
	 * text to be classified
	 */
	var $text;
	
	/**
	 * rating of the text
	 */
	var $rating;
	

    /**
     * Constructor
     * 
     * @date 2009-1-20
     */
    function startup(&$controller)
    {
    	//register a CommentModel to get the DBO resource link
    	$comment = ClassRegistry::init('Comment');

		//import b8 and create an instance
		App::import('Vendor', 'b8/b8');
		$this->b8 = new b8($comment->getDBOResourceLink());
		
		//set standard rating
		$this->standardRating = Configure::read('LT.bayesRating') ? Configure::read('LT.bayesRating') : $this->standardRating;
    }

    /**
     * Set the text to be classified
     * 
     * @param $text String the text to be classified
     * @date 2009-1-20
     */
    function set($text)
    {	
		$this->text = $text;
    }
    
    /**
     * Get Bayesian rating
     * 
     * @date 2009-1-20
     */
    function rate()
    {	
		//get Bayes rating and return
		return $this->rating = $this->b8->classify($this->text);
    }
    
    /**
     * Validate a message based on the rating, return true if it's NOT a SPAM
     * 
     * @date 2009-1-20
     */
    function validate()
    {
		return $this->rate() standardRating;
    }
    
    /**
     * Learn a SPAM or a HAM
     * 
     * @date 2009-1-20
     */
    function learn($mode)
    {
		$this->b8->learn($this->text, $mode);
    }
    
    /**
     * Unlearn a SPAM or a HAM
     * 
     * @date 2009-1-20
     */
    function unlearn($mode)
    {
		$this->b8->unlearn($this->text, $mode);
    }
}

几点说明:

  1. $standardRating是一个临界点。如果贝叶斯概率高于这个值,则此留言被认为是spam,否则是ham。我设置为0.7,你可以根据自己的情况修改;
  2. Configure::read('LT.bayesRating')是从系统运行配置中动态地获取上述临界点的值,这是我的做法,你可能用不到,根据情况稍微修改甚至不修改都行;
  3. Comment指的是评论的model;
  4. 由于b8需要获得数据库句柄以便能够操作数据表,所以在startup()中我写了$this->b8 = new b8($comment->getDBOResourceLink())一句,其中用到的getDBOResourceLink()马上会提及。

为b8传入数据库句柄

在models/comment.php中加入如下代码:

	/**
	 * get the resource link of MySQL connection
	*/
	public function getDBOResourceLink()
	{
		return $this->getDataSource()->connection;	
	}

至此,准备工作全部做完,我们终于可以使用贝叶斯算法来分类留言。

使用b8分类留言

在controllers/comments_controller.php中,首先载入SpamShieldComponent:

    var $components = array('SpamShield');

然后在add()方法中,做如下操作:

		//set data for Bayesian validation
		$this->SpamShield->set($this->data['Comment']['body']);

		//validate the comment with Bayesian
		if(!$this->SpamShield->validate())
		{
			//set the status
			$this->data['Comment']['status'] = 'spam';
			
			//save
			$this->Comment->save($this->data);

			//learn it
			$this->SpamShield->learn("spam");

			//render
			$this->renderView('unmoderated');
			return;
		}

		//it's a normal post
		$this->data['Comment']['status'] = 'published';
		
		//save for publish
		$this->Comment->save($this->data);

		//learn it
		$this->SpamShield->learn("ham");

如此一来,b8就会在留言到来时自动的分类并学习,你基本上已经与spam绝缘了!

提醒一下:第一次运行后,别忘了把刚才提到的createDB改为FALSE。


  1. ehaagwlke @ 2009-06-25 20:15:44 +0800:

    firefox 3.0.11访问你的主页报“Reported Attack Site!”。

  2. Felix @ 2009-07-01 06:26:08 +0800:

    @ehaagwlke 嗯,见 http://dingyu.me/blog/posts/view/notes-in-20090630