In this lab we want to separate the body, the header of a email. In order to do that, we will use an implementation of the Viterbi algorithm to compute the best segmentation between header and body.
The initial probability is $ \pi $ = $(1, 0 )$. We always start an email by a header, so the probability to begin with state one is 1.
The probability to move from state 1 to state 2 is 0.000781921964187974.
The probability to remain in state 2 is 1.
It seems to be unlikely to move from state 1 to state 2, and very likely to stay in its own state.
The lower probability is to move from state 2 to state 1 : 0.
The higher probability is to move frome state 2 to state 2 : 1.
I suppose that is because there are few specific words that make the transistion from a header to a body in a email.
The size of the corresponding matrix is by convention $(N, 2)$.
This is done in the notebook file. The results seems to be accurate, there are some errors but they do not exceed one or two line around the expected separation between the header and the body.
I would use an other transistion matrix with a dimension of (3,3), still triangular superior. And I would get a third row to the emission matrix. With these elements I would use the same algorithms to get the three parts.
I would change the emission matrix and associate a very high number for the character “>”, That change will help to separate the portions of mail included.
In the process of computing probabilities, with the data given, we had value under 10^-308 which is the double limit in python. This limit gave me a lot of debugging, the simplest solution i have found, discussing with my pairs was to multiplicate by a magic factor of 40. This factor will make this algorithm works only on this type of problems. In order to make this algorithm works to all types of problems, a transformation with a logarithm would work (but this necessitate handling the 0 values ).