基于Python语言的大数据搜索引擎(3)

发布时间：2019-07-10 17:56 所属栏目：21 来源：简单艾

导读：上代码： classSplunk(object): def__init__(self): self.bf=Bloomfilter(64) self.terms={}#Dictionaryoftermtosetofevents self.events=[] defadd_event(self,event): Addsaneventtothisobject #GenerateauniqueI

上代码：

class Splunk(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search(self, term): 
 """Search for a single term, and yield all the events that contain it""" 
 # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory) 
 if not self.bf.might_contain(term): 
 return 
 # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx 
 if term not in self.terms: 
 return 
 for event_id in sorted(self.terms[term]): 
 yield self.events[event_id]

Splunk代表一个拥有搜索功能的索引集合
每一个集合中包含一个布隆过滤器，一个倒排词表(字典)，和一个存储所有事件的数组
当一个事件被加入到索引的时候，会做以下的逻辑
为每一个事件生成一个unqie id，这里就是序号
对事件进行分词，把每一个词加入到倒排词表，也就是每一个词对应的事件的id的映射结构，注意，一个词可能对应多个事件，所以倒排表的的值是一个Set。倒排表是绝大部分搜索引擎的核心功能。
当一个词被搜索的时候，会做以下的逻辑
检查布隆过滤器，如果为假，直接返回
检查词表，如果被搜索单词不在词表中，直接返回
在倒排表中找到所有对应的事件id，然后返回事件的内容

我们运行下看看把：

s = Splunk() 
s.add_event('src_ip = 1.2.3.4') 
s.add_event('src_ip = 5.6.7.8') 
s.add_event('dst_ip = 1.2.3.4') 
for event in s.search('1.2.3.4'): 
 print event 
print '-' 
for event in s.search('src_ip'): 
 print event 
print '-' 
for event in s.search('ip'): 
 print event 
src_ip = 1.2.3.4 
dst_ip = 1.2.3.4 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
- 
src_ip = 1.2.3.4 
src_ip = 5.6.7.8 
dst_ip = 1.2.3.4

是不是很赞!

更复杂的搜索

更进一步，在搜索过程中，我们想用And和Or来实现更复杂的搜索逻辑。

上代码：

class SplunkM(object): 
 def __init__(self): 
 self.bf = Bloomfilter(64) 
 self.terms = {} # Dictionary of term to set of events 
 self.events = [] 
 def add_event(self, event): 
 """Adds an event to this object""" 
 # Generate a unique ID for the event, and save it 
 event_id = len(self.events) 
 self.events.append(event) 
 # Add each term to the bloomfilter, and track the event by each term 
 for term in segments(event): 
 self.bf.add_value(term) 
 if term not in self.terms: 
 self.terms[term] = set() 
 self.terms[term].add(event_id) 
 def search_all(self, terms): 
 """Search for an AND of all terms""" 
 # Start with the universe of all events... 
 results = set(range(len(self.events))) 
 for term in terms: 
 # If a term isn't present at all then we can stop looking 
 if not self.bf.might_contain(term): 
 return 
 if term not in self.terms: 
 return 
 # Drop events that don't match from our results 
 results = results.intersection(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id] 
 def search_any(self, terms): 
 """Search for an OR of all terms""" 
 results = set() 
 for term in terms: 
 # If a term isn't present, we skip it, but don't stop 
 if not self.bf.might_contain(term): 
 continue 
 if term not in self.terms: 
 continue 
 # Add these events to our results 
 results = results.union(self.terms[term]) 
 for event_id in sorted(results): 
 yield self.events[event_id]

利用Python集合的intersection和union操作，可以很方便的支持And(求交集)和Or(求合集)的操作。

（编辑：ASP站长网）