关键词匹配经典算法——DFA

起因
从网页中爬取的数据,需要判断是否包含预设的关键词,并返回所有匹配到的关键词。
文字匹配(过滤)是很多文字类网站必不可少的功能,采用一个高效的文字过滤算法至关重要。其本质需求就是判断集合A中哪些子集属于集合B。

可行思路:

  • 暴力解决,说说而已,实际不会这么做;
  • 正则表达式
  • DFA

  接下来主要介绍下DFA算法。DFA是一个实现状态转移的自动机,基本功能是通过event和当前state得到下一个state。其算法核心就是构建一棵一棵的树,这样我们就可以确定需要检索的那棵树,然后再在这棵树中进行检索。算法详细内容参见维基百科
  下面贴出用Python写的一DFA类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class DFA:
current_state = None
def __init__(self, states, alphabet, transition_function, start_state, accept_states):
self.states = states
self.alphabet = alphabet
self.transition_function = transition_function
self.start_state = start_state
self.accept_states = accept_states
self.current_state = start_state
def transition_to_state_with_input(self, input_value):
if ((self.current_state, input_value) not in self.transition_function.keys()):
self.current_state = None
self.current_state = self.transition_function[(self.current_state, input_value)]
def in_accept_state(self):
return self.current_state in accept_states
def go_to_initial_state(self):
self.current_state = self.start_state
def run_with_input_list(self, input_list):
self.go_to_initial_state()
for inp in input_list:
self.transition_to_state_with_input(inp)
continue
return self.in_accept_state()
pass

  给出一个测试用例:states:状态集,alphabet:字母表,tf:转移函数,start_state:开始状态,accept_states:可接受状态集.当输入‘bcda’时,由于(0,’b’)->(2,’c’)->(3,’d’)->(0,’a’)->final_state=1,不可接受的状态,所以返回False。

1
2
3
4
5
6
7
8
9
10
11
12
13
states = {0, 1, 2, 3}
alphabet = {'a', 'b', 'c', 'd'}
tf = dict()
tf[(0, 'a')] = 1;tf[(0, 'b')] = 2;tf[(0, 'c')] = 3;tf[(0, 'd')] = 0
tf[(1, 'a')] = 1;tf[(1, 'b')] = 2;tf[(1, 'c')] = 3;tf[(1, 'd')] = 0
tf[(2, 'a')] = 1;tf[(2, 'b')] = 2;tf[(2, 'c')] = 3;tf[(2, 'd')] = 0
tf[(3, 'a')] = 1;tf[(3, 'b')] = 2;tf[(3, 'c')] = 3;tf[(3, 'd')] = 0
start_state = 0
accept_states = {2, 3}
d = DFA(states, alphabet, tf, start_state, accept_states)
inp_program = list('bcda')
print(d.run_with_input_list(inp_program)) #False

DFA的神奇魔力
运用DFA解决 Valid Number 问题(LeetCode):
Description: Validate if a given string is numeric.
Some examples:
“0” => true
“ 0.1 “ => true
“abc” => false
“1 a” => false
“2e10” => true
思考
简单实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class Solution(object):
def isNumber(self, s):
"""
:type s: str
:rtype: bool
"""
#define DFA
state = [{},
{'blank': 1, 'sign': 2, 'digit':3, '.':4},
{'digit':3, '.':4},
{'digit':3, '.':5, 'e':6, 'blank':9},
{'digit':5},
{'digit':5, 'e':6, 'blank':9},
{'sign':7, 'digit':8},
{'digit':8},
{'digit':8, 'blank':9},
{'blank':9}]
currentState = 1
for i in s:
if i >='0' and i<='9':
i = 'digit'
if i in ['+','-']:
i = 'sign'
if i ==' ' :
i = 'blank'
if i not in state[currentState].keys():
return False
currentState = state[currentState][i]
if currentState not in [3,5,8,9]:
return False
return True
#test
a = Solution()
print(a.isNumber('.e-1')) #False
print(a.isNumber('e2')) #False
print(a.isNumber('2e10')) #True

  So easy 吧!是不是比正则高级多了:)

Do things don’t scale !
分享