python linear_parsing.py filename_of_xml query
Algorithm (sketch):
This algorithm explanation is quite clear. The only missing part is the implementation of the failure transition
The failure transition tell us to which state we want to go when a failure is done (when path[i] != current_element).
def get_failure_transition(path):
path = np.array(path)
fail_array = [0]
for i in range(1,len(path)+1):
j = 0
while(j < i and i >= 1 and np.array_equal(path[:j] , path[i-j:i])):
j+=1
if(j > 0):
j -= 1
fail_array.append(j)
return fail_array
This code will give an array where array[i] will give the next state when there is a failure at the state i.
If we do not implement the failure transition and we consider that the state 0 is the only state when there is a failure, the query “//a/a” will be wrong for example.
The linear parse algorithm is in the src/linear_parsing.py file:
def linear_parse(file,path):
# Start document
S = []
state_i = [0]
i = 0
count = 0
Results = []
fail_array = get_failure_transition(path)
with open(file,'r') as f:
for line in f :
begin, current_el = line.split(' ')[0:2]
current_el = current_el.replace('\n','')
i = state_i[-1]
if(i == len(path)):
if(S[-1] not in Results):
Results.append(S[-1])
i = fail_array[i]
if(begin == '0'): # true, start Element
S.append(count)
count += 1
if(path[i] == current_el):
i += 1
else: #do failure transition
i = fail_array[i]
if(path[i] == current_el):
i+=1
state_i.append(i)
else:
if(S):
S.pop()
state_i.pop()
return Results
it follows the algorithm given.
The algorithm can be tested in the test.py file where there is a test case to check if the algorithm is correct.
python lazy_dfa.py filename_of_xml query
The lazy dfa is like a normal dfa with the possibily of stay put when there is an error.
So we need to separate the query into multiples p_i and then transform our failure transition : when we end a p_i xpath, the failure transition give the latest state when there is a failure.
This is done here :
for i in range(len(cuts) - 1 ):
indice_i = cuts[i]
indice_j = cuts[i+1]
fail_tmp = get_failure_transition(path[indice_i:indice_j])
fail_tmp = [x + indice_i for x in fail_tmp]
fail_array += fail_tmp
In this way we can process the desired queries.
The tests programms are called get_perf_lazy.py
and get_perf_linear.py
The memory usage captured does not work, I tried numerus techniques on jupyter notebook and python script but the results where never consistent. I suppose this due to the fact of how python use its data.
These results were not good, They show that there is not correlations between the height of the document and the query size, changed alone.
But when we modify both the number of node and height we observe that the algorithm should be linear in the term of number of node, and the multipling factor whould be correlated to the height of the document.
As for the linear parsing, the results were not really conclusive for the the number of node and query size variation alone.
But when we modify both the number of node and height we observe that the algorithm behave in a similar fashion as the linear one.
Although the results of my experiments were not conclusive, that the memory measurement did not work properly. I think that the two algorithm can be used to work with large data, as seen in the tests when there were a variation of the number of nodes and height of the xml document. In addition, the fact that theses algorithms can be used in a streaming way is quite useful, and that’s probably why my memory measurment were not working.