EMLStructureParser is now able to dump text from text/* parts. We are finishing the code. Naïve Bayes will be able to take advantage from this feature.
The extracion of text from an email is only done 1 time. We use a caching scheme to avoid computing it a lot of times.
We also cheked some errors during the parsing to detect malformed rfc2822 files.
And all of this is made using a finite-state machine scheme with a stack when parsing body parts.
More improvements on EMLStructureParser
Comments are off for this post