Supervised Machine Learning for Email Thread Summarization
By Jan Ulrich
Email has become a part of most people's lives, and the ever increasing amount
of messages people receive can lead to email overload. We attempt to mitigate
this problem using email thread summarization. We have built a machine learning
summarizer for emails as well as annotated a dataset for training. While
previous research has shown that machine learning algorithms are a promising
approach to email summarization, there has not been a study on the impact of the
choice of algorithm. We explore new techniques in email thread summarization
using several regression-based classifiers, and the results show that the choice
of classifier is very critical. We also present a novel feature set for email
summarization and do analysis on two email corpora. The BC3 corpus, a new
publicly available email dataset that is annotated for summarization purposes,
is introduced as well as the open source framework that we built to do the
annotation.