博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Distributed NLTK with execnet | StreamHacker
阅读量:7037 次
发布时间:2019-06-28

本文共 4317 字,大约阅读时间需要 14 分钟。

Hi! If you enjoy this post, you might want to or .

(for a Belorussian translation of this article, )

Want to speed up your with ? Have a lot of files to process, but don't know how to NLTK across many cores?

Well, here's how you can use to do distributed of with .

execnet

is a simple library for creating a network of and that you can use for distributed computation in . With it, you can start shells over , send code and/or data, then receive results. Below are 2 scripts that will test the accuracy of against every file in the . The first script (the runner) does all the setup and receives the results, while the second script (the remote module) runs on every gateway, calculating and sending the accuracy of each file it receives for processing.

Runner

The runner does the following:

  1. Defines the hosts and number of gateways. I recommend 1 gateway per core per host.
  2. Loads and the default NLTK part of speech tagger.
  3. Opens each gateway and creates a remote execution channel with the tag_files module (the remote module covered below).
  4. Sends the pickled tagger and the name of a corpus (brown) thru the channel.
  5. Once all the channels have been created and initialized, it then sends all of the in the corpus to alternating channels to distribute the work.
  6. Finally, it creates a and prints the accuracy response from each channel.

run_tag_files.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import
execnet
import
nltk.tag, nltk.data
import
cPickle as pickle
import
tag_files
 
HOSTS
=
{
    
'localhost'
:
2
}
 
NICE
=
20
 
channels
=
[]
 
tagger
=
pickle.dumps(nltk.data.load(nltk.tag._POS_TAGGER))
 
for
host, count
in
HOSTS.items():
    
print
'opening %d gateways at %s'
%
(count, host)
 
    
for
i
in
range
(count):
        
gw
=
execnet.makegateway(
'ssh=%s//nice=%d'
%
(host, NICE))
        
channel
=
gw.remote_exec(tag_files)
        
channels.append(channel)
        
channel.send(tagger)
        
channel.send(
'brown'
)
 
count
=
0
chan
=
0
 
for
fileid
in
nltk.corpus.brown.fileids():
    
print
'sending %s to channel %d'
%
(fileid, chan)
    
channels[chan].send(fileid)
    
count
+
=
1
    
# alternate channels
    
chan
+
=
1
    
if
chan >
=
len
(channels): chan
=
0
 
multi
=
execnet.MultiChannel(channels)
queue
=
multi.make_receive_queue()
 
for
i
in
range
(count):
    
channel, response
=
queue.get()
    
print
response

Remote Module

The remote module is much simpler.

  1. Receives and unpickles the tagger.
  2. Receives the corpus name and loads it.
  3. For each fileid received, evaluates the accuracy of the tagger on the tagged sentences and sends an accuracy response.

tag_files.py

1
2
3
4
5
6
7
8
9
10
11
import
nltk.corpus
import
cPickle as pickle
 
if
__name__
=
=
'__channelexec__'
:
    
tagger
=
pickle.loads(channel.receive())
    
corpus_name
=
channel.receive()
    
corpus
=
getattr
(nltk.corpus, corpus_name)
 
    
for
fileid
in
channel:
        
accuracy
=
tagger.evaluate(corpus.tagged_sents(fileids
=
[fileid]))
        
channel.send(
'%s: %f'
%
(fileid, accuracy))

Putting it all together

Make sure you have and the installed on every host. You must also have access to each host from the master host (the machine you run run_tag_files.py on).

run_tag_files.py and tag_files.py only need to be on the master host; will take care of distributing the code. Assuming run_tag_files.py and tag_files.py are in the same directory, all you need to do is run python run_tag_files.py. You should get a message about opening gateways followed by a bunch of send messages. Then, just wait and watch the accuracy responses to see how accurate the built in part of speech tagger is on the brown corpus.

If you'd like test the accuracy of a different corpus, make sure every host has the corpus data, then send that corpus name instead of brown, and send the fileids from the new corpus.

If you want to test your own , pickle it to a file, then load and send it instead of NLTK's tagger. Or you can train it on the master first, then send it once training is complete.

Distributed File Processing

In practice, it's often a to make sure every host has every file you want to process, and you'll want to process files outside of NLTK's builtin corpora. My recommendation is to setup a so that every host has a common mount point with access to every file that you want to process. If every host has the same mount point, you can send any file path to any channel for processing.

转载地址:http://vptal.baihongyu.com/

你可能感兴趣的文章
Android适配文件dimen自动生成代码
查看>>
走马观花--快餐学python笔记
查看>>
jquery轻量级富文本编辑器Trumbowyg
查看>>
(二十八)static关键字
查看>>
转 MySQL数据库基础
查看>>
ubuntu 解压命令全部
查看>>
Chrome教程(一)NetWork面板分析网络请求
查看>>
第十八回  基础才是重中之重~开发人员应学会用throw
查看>>
Swift -- 中文版两大官方文档汇总
查看>>
U3D调用7z解压文件
查看>>
Windows移动开发(二)——闭关修炼
查看>>
java 获取 path
查看>>
小盆友给谷歌写封信 老爸获一周假期
查看>>
Ubuntu安装配置Qt环境
查看>>
LBS 与 GPS 定位之间的区别
查看>>
Android调用系统的Activity、ContentProvider、Service、Broadcast Receiver
查看>>
对象池模式
查看>>
Android学习笔记(四十):Preference的使用
查看>>
ByteArrary(优化数据存储和数据流)
查看>>
围住神经猫,朋友圈瞬间爆红是如何炼成的?
查看>>