|
This research addresses the problem of identification of sentential paraphrases; that is, the
ability of an estimator to predict well whether two sentential text fragments are
paraphrases. The paraphrase identification task has practical importance in the Natural
Language Processing (NLP) community because of the need to deal with the pervasive
problem of linguistic variation.
Accurate methods for identifying paraphrases should help to improve the performance of
NLP systems that require language understanding. This includes key applications such as
machine translation, information retrieval and question answering amongst others. Over
the course of the last decade, a growing body of research has been conducted on
paraphrase identification and it has become an individual working area of NLP.
Our objective is to investigate whether techniques concentrating on automated
understanding of text requiring less resource may achieve results comparable to methods
employing more sophisticated NLP processing tools and other resources. These
techniques, which we call "knowledge-lean", range from simple, shallow overlap
methods based on lexical items or n-grams through to more sophisticated methods that
employ automatically generated distributional thesauri.
The work begins by focusing on techniques that exploit lexical overlap and text-based
statistical techniques that are much less in need of NLP tools. We investigate the question
"To what extent can these methods be used for the purpose of a paraphrase identification
task?" For the two gold standard data, we obtained competitive results on the Microsoft
Research Paraphrase Corpus (MSRPC) and reached the state-of-the-art results on the
Twitter Paraphrase Corpus, using only n-gram overlap features in conjunction with
support vector machines (SVMs).
These techniques do not require any language specific tools or external resources and
appear to perform well without the need to normalise colloquial language such as that
found on Twitter. It was natural to extend the scope of the research and to consider
experimenting on another language, which is poor in resources. The scarcity of available
paraphrase data led us to construct our own corpus; we have constructed a paraphrase corpus in Turkish. This corpus is relatively small but provides a representative collection,
including a variety of texts. While there is still debate as to whether a binary or finegrained
judgement satisfies a paraphrase corpus, we chose to provide data for a sentential
textual similarity task by agreeing on fine-grained scoring, knowing that this could be
converted to binary scoring, but not the other way around. The correlation between the
results from different corpora is promising. Therefore, it can be surmised that languages
poor in resources can benefit from knowledge-lean techniques.
Discovering the strengths of knowledge-lean techniques extended with a new perspective
to techniques that use distributional statistical features of text by representing each word
as a vector (word2vec). While recent research focuses on larger fragments of text with
word2vec, such as phrases, sentences and even paragraphs, a new approach is presented
by introducing vectors of character n-grams that carry the same attributes as word
vectors. The proposed method has the ability to capture syntactic relations as well as
semantic relations without semantic knowledge. This is proven to be competitive on
Twitter compared to more sophisticated methods.
Keywords: Paraphrasing, Knowledge-Lean, Twitter, Turkish, MSRPC, SVMs, N-grams,
Overlap methods. Word2Vec |