SAE部署jieba分词模块提升加载速度

jieba中文分词模块（github项目地址：https://github.com/fxsjy/jieba）是一款实用Python实现的中文分词组件。

由于jieba中文分词模块在首次加载时需要生成字典树（Trie树）缓存文件（jieba.cache），导致其处理速度会受到一定的影响。

要解决这个问题，可以将jieba.cache缓存文件预先生成完毕，上传至SAE的代码空间。

运行jieba分词时直接读取缓存文件即可，不必每次重新生成，从而提升jieba分词的模块加载速度。

下面以jieba分词（v3.1)为例讲解修改方法的具体步骤。

1. 在本地环境使用jieba分词模块，生成jieba.cache文件（默认生成在本地环境的临时目录下），将该文件拷贝至jieba/目录下
2. 修改jieba/__init__.py文件，在代码前部增加导入： import sae.core
3. 修改jieba/__init__.py文件，找到代码片段：

if abs_path == os.path.join(_curpath,"dict.txt"): #defautl dictionary
  cache_file = os.path.join(tempfile.gettempdir(),"jieba.cache")
else: #customer dictionary
  cache_file = os.path.join(tempfile.gettempdir(),"jieba.user."+str(hash(abs_path))+".cache")

load_from_cache_fail = True
if os.path.exists(cache_file) and os.path.getmtime(cache_file)>os.path.getmtime(abs_path):

4. 将如上代码修改为：

temp_dir = None
if 'SERVER_SOFTWARE' in os.environ:
  temp_dir = sae.core.get_tmp_dir()
else:
  temp_dir = tempfile.gettempdir()
if abs_path == os.path.join(_curpath,"dict.txt"): #defautl dictionary
  cache_file = os.path.join(_curpath,"jieba.cache")
else: #customer dictionary
  cache_file = os.path.join(_curpath,"jieba.user."+str(hash(abs_path))+".cache")
print temp_dir

load_from_cache_fail = True
if os.path.exists(cache_file):

本文链接：http://bookshadow.com/weblog/2014/09/18/sae-jieba-speed-up/
请尊重作者的劳动成果，转载请注明出处！书影博客保留对文章的所有权利。

周一	周二	周三	周四	周五	周六	周日
2014年8月				2014年10月
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30