上代码:
- class Splunk(object):
- def __init__(self):
- self.bf = Bloomfilter(64)
- self.terms = {} # Dictionary of term to set of events
- self.events = []
- def add_event(self, event):
- """Adds an event to this object"""
- # Generate a unique ID for the event, and save it
- event_id = len(self.events)
- self.events.append(event)
- # Add each term to the bloomfilter, and track the event by each term
- for term in segments(event):
- self.bf.add_value(term)
- if term not in self.terms:
- self.terms[term] = set()
- self.terms[term].add(event_id)
- def search(self, term):
- """Search for a single term, and yield all the events that contain it"""
- # In Splunk this runs in O(1), and is likely to be in filesystem cache (memory)
- if not self.bf.might_contain(term):
- return
- # In Splunk this probably runs in O(log N) where N is the number of terms in the tsidx
- if term not in self.terms:
- return
- for event_id in sorted(self.terms[term]):
- yield self.events[event_id]
- Splunk代表一个拥有搜索功能的索引集合
- 每一个集合中包含一个布隆过滤器,一个倒排词表(字典),和一个存储所有事件的数组
- 当一个事件被加入到索引的时候,会做以下的逻辑
- 为每一个事件生成一个unqie id,这里就是序号
- 对事件进行分词,把每一个词加入到倒排词表,也就是每一个词对应的事件的id的映射结构,注意,一个词可能对应多个事件,所以倒排表的的值是一个Set。倒排表是绝大部分搜索引擎的核心功能。
- 当一个词被搜索的时候,会做以下的逻辑
- 检查布隆过滤器,如果为假,直接返回
- 检查词表,如果被搜索单词不在词表中,直接返回
- 在倒排表中找到所有对应的事件id,然后返回事件的内容
我们运行下看看把:
- s = Splunk()
- s.add_event('src_ip = 1.2.3.4')
- s.add_event('src_ip = 5.6.7.8')
- s.add_event('dst_ip = 1.2.3.4')
- for event in s.search('1.2.3.4'):
- print event
- print '-'
- for event in s.search('src_ip'):
- print event
- print '-'
- for event in s.search('ip'):
- print event
- src_ip = 1.2.3.4
- dst_ip = 1.2.3.4
- -
- src_ip = 1.2.3.4
- src_ip = 5.6.7.8
- -
- src_ip = 1.2.3.4
- src_ip = 5.6.7.8
- dst_ip = 1.2.3.4
是不是很赞!
更复杂的搜索
更进一步,在搜索过程中,我们想用And和Or来实现更复杂的搜索逻辑。
上代码:
- class SplunkM(object):
- def __init__(self):
- self.bf = Bloomfilter(64)
- self.terms = {} # Dictionary of term to set of events
- self.events = []
- def add_event(self, event):
- """Adds an event to this object"""
- # Generate a unique ID for the event, and save it
- event_id = len(self.events)
- self.events.append(event)
- # Add each term to the bloomfilter, and track the event by each term
- for term in segments(event):
- self.bf.add_value(term)
- if term not in self.terms:
- self.terms[term] = set()
- self.terms[term].add(event_id)
- def search_all(self, terms):
- """Search for an AND of all terms"""
- # Start with the universe of all events...
- results = set(range(len(self.events)))
- for term in terms:
- # If a term isn't present at all then we can stop looking
- if not self.bf.might_contain(term):
- return
- if term not in self.terms:
- return
- # Drop events that don't match from our results
- results = results.intersection(self.terms[term])
- for event_id in sorted(results):
- yield self.events[event_id]
- def search_any(self, terms):
- """Search for an OR of all terms"""
- results = set()
- for term in terms:
- # If a term isn't present, we skip it, but don't stop
- if not self.bf.might_contain(term):
- continue
- if term not in self.terms:
- continue
- # Add these events to our results
- results = results.union(self.terms[term])
- for event_id in sorted(results):
- yield self.events[event_id]
利用Python集合的intersection和union操作,可以很方便的支持And(求交集)和Or(求合集)的操作。
(编辑:ASP站长网)
|