Tez No İndirme Tez Künye Durumu
439533
Türkçe tümcelerin yüklem odaklı anlam ve dilbilgisi çözümlemesi / Grammatical and semantic analysis of turkish sentence based on predicate
Yazar:İLKNUR DÖNMEZ
Danışman: PROF. DR. EŞREF ADALI
Yer Bilgisi: İstanbul Teknik Üniversitesi / Fen Bilimleri Enstitüsü / Bilgisayar Mühendisliği Ana Bilim Dalı / Bilgisayar Mühendisliği Bilim Dalı
Konu:Bilgisayar Mühendisliği Bilimleri-Bilgisayar ve Kontrol = Computer Engineering and Computer Science and Control
Dizin:
Onaylandı
Doktora
Türkçe
2016
124 s.
Çalışmamız tümcelerin anlamsal ve dilbilgisi çözümlemesini içermektedir. Tümcenin anlamsal ve dilbilgisi açısından çözümlenmesi Doğal Dil İşleme (DDİ)'nin ana konulardan biridir. Çalışmamızda, çözümleme yapılırken önce metin içindeki tümcelerin her biri basit tümce olacak şekilde alt tümcelere ayrıştırılmaktadır. Her bir alt tümceye ait öbek kavram çiftleri bulunmakta ve daha sonra her bir alt tümcedeki temel dilbilgisi ve anlamsal yanlışları saptamak için yüklemi temel alan yeni bir yöntem önerilmektedir. Türkçe tümcede yüklem özne ve zaman bilgisi içerir. Ayrıca yüklem, o tümcenin hangi öbeklerden oluşabileceği konusunda da belirleyicidir. örneğin, ``büyümek'' yüklemi tümce içinde nesne almazken, ``-de'' ekiyle biten dolaylı tümleç öbeğini alır. Örneğin ``Ayşeyi büyüdü.'' tümcesi sorunluyken, ``Sokakta büyüdü.'' tümcesi doğrudur. Yüklem ayrıca her bir öbeğin içereceği kavram hakkında da bilgi içermektedir. örneğin ``düşünmek'' yüklemi insanlara özgüdür. Dolayısıyla özne olarak insan kavramıyla ilişkilidir. ``Kapı bugün ne yapacağını düşündü.'' tümcesi mantıklı değildir. Bu saptamalardan yola çıkarak çalışmamızda, tümcelerin öbekleri bulunmuş; her bir öbeğin hangi kavramla ilişkili olduğu belirlenmiş ve tümcenin dilbilgisi çözümlemesini ve anlam çözümlemesini yapan bir model tasarlanmıştır. Çalışmamızda tümceler hal ekleri ve temel öbek yapıları kullanılarak 10 öbeğe ayrıştırılmıştır. Her bir öbeğin içerdiği kavramın 51 kavram sınıfından hangisine ait olduğu belirlenmiştir. öbek-kavram çiftlerinin yüklemle uyumluluğu araştırılırken çatı, kişi ve zaman ekleri de değerlendirilmiştir. Çalışmamızda, her öbek kavram bir matris elemanıyla temsil edilmektedir oysa birleşik tümcelerde öbekler içinde iç tümcelere sahip olabilmektedir. örneğin ``okula sevinçle gelen Ayşe'' öznesi içinde farklı bir iç tümceyi içermektedir. Anlamsal ve dilbilgisi hatası bu iç tümcelerde bulunabilmektedir. örneğin eğer özne öbeği ``okulda sevinçle gelen Ayşe'' olsaydı, gelmek fiili bulunma öbeğiyle uyumlu olmayacak dolayısıyla bu özneyi içeren tümce doğru bir tümce olmayacaktı. Bu nedenle çalışmamız iç tümceleri de içerecek şekilde genişletilmiştir. Bu amaçla tümceler içerdikleri sıfat fiil, zarf fiil yada mastar sayısınca alt tümceye bölünmüştür ve ayrılan her bir alt tümce için çözümleme tekrarlanmıştır. Günümüzde hala pek çok DDİ uygulamasında, tümcelerin içerdiği her bir kelime binler boyutunda temsil edilmekte, farklı kelime sayısına sahip tümcelerin boyutları sabit olmamakta ve tüm bu tümce temsili oldukça ayrık bir yapıya sahip olmaktadır. Çalışmamızda oluşturulan, tümcenin içerdiği öbek kavram türünden sabit uzunluklu, nispeten az boyutlu (10x51) kaba anlamsal matris temsili pek çok anlamsal DDİ çalışmasında kullanılabilecek özelliktedir. Çalışmamızın son bölümünde bu temsilin anlamsal uygulamalarda başarı sağladığı gösterilmiştir. Tümcenin yüzeysel anlamını içeren matris yapısının son satırına yüklemin özellikleri de eklenerek bir döküman sınıflama uygulamasında kullanılmıştır. WEKA paketi ile beş farklı çeşit sınıflandırma algoritması kullanılarak beş ayrı katagorideki dökümanlar sınıflandırılmış sonuçta 145 özellikle 86.10 başarı elde edilmiştir. Modelimize ait özellikleri eski özelliklere eklediğimizde en yüksek başarı olan 97,12'lik en yüksek başarı değeri elde edilmiştir. Sonuç olarak bu çalışmamızda tümcenin öbek kavram vektör temsili oluşturulmuş ve tümcenin dil bilgisel ve anlamsal olarak çözümlenmesi için vektör kıyaslanması kullanan yeni bir yöntem sunulmuştur. Bu yöntemle yapısal hatalardan hedeflenen %81,16'lık dilim içinden %64'lük hata tespit edilmiştir. Çalışmamız ayrıca %81,16 başarı ile tümcelerin alt tümcelerinin bulunduğu; %89 başarıyla tümcelerin kendilerinin ve alt tümcelerinin öbeklerine ayrıldığı, %82,8 başarıyla içerdiği kavramların bulunduğu, içerdiği zaman türünün incelenip yüklemle kıyaslandığı, öznesinin tipinin, tekil ya da çoğul olduğunun incelendiği Türkçe tümce çözümleme kaynağı olmak hedefindedir. Türkçenin düzenli tümce yapısı ve düzenli yüklem yapısı bu çalışmanın esin kaynağı olmasına karşın, öbek-kavram temsili tüm diller için kullanılabilecek bir yöntemdir.
The grammatical and semantic analysis of the sentence is one of the main subjects of Natural Language Processing (NLP). In this study, the sentences are separated into their sub sentences, the related phrases and their concepts are found for each sentences and the coarse-grained semantic representation is done for each sub sentences. In this study, we present a novel method to detect basic grammatical and semantic disorders by concentrating on the predicate. In Turkish, the predicate includes information about the subject and tense. The predicate also helps to identify the phrases which make up the sentence. For example, "büyümek (to grow)" does not take an object, but it can take a locative phrase ending with the suffix "-de". The predicate is also informative about the semantic concept of a phrase. For example "düşünmek (to think)" is specifically an action performed by a human, so the subject will be related with the concept of a human. With these properties considered, a model has been designed to find phrases in a sentence, identify their relations to specific concepts, and analyse the sentences grammatically and semantically. Because of analysing sentences grammatically and semantically, first of all sentence is divided into sub-sentences. The number of sub-sentences depends on the gerunds (verbal nouns), participles (verbal adjectives) and con-verbs (verbal adverbs) in the sentence. A compound sentence may have more than one complex sentence and each complex sentence may have more than one sub-sentence. If the sentence is compound, the first complex sentence is taken and the reminder part is stored. For the complex part the number of the light verbs gives the number of the sub sentences that we want to maintain. For each light verb form and their related phrases the sub sentences are generated with determined rules. After the all sub sentences of the complex sentence are generated, the process goes on from the starting point, first complex sentence of the reminder part is found and algorithm goes on until all sub sentences are found. Grammatical analysis in our study involves the presence of argument phrases in the sub-sentences. İTÜ NLP dependency parser outputs, case markers in the sentence and formal language representation with phrases that we determined is used to find phrases in a sentence. Then the phrases of the sentence and the concept of each phrases is found. Maintained phrase-concept pairs are checked with predicate according to its compatibility for each sub-sentences. The grammar checking problem has been studied with the development of the language technologies since the 1970s. Today for English a grammar checker (GC) program can detect various errors, such as agreement in tense, number, word order and in the last ten years GC recognize grammar errors based on the content of the surrounding words. Different rule based, statistical and hybrid methods have been used for English grammar checking applications. Doğan ve Karaağaç in 1012, İşgüder and Adalı in 2014 and Aygül analyse Turkish sentences grammatically. There are also text spell error correction studies in Turkish. Despite the efficient GC applications, there are usually too many exceptions in real usage of a natural language. In our study the sentences and text are represented as condensed vectors or matrices. Condensed vector representation of words, sentences and texts has become crucial because of big data processing issues. In most natural language applications, sparseness is one of the important issues. Vector representation of the words is done via deep learning in 2013. The distance between the word vectors can show the semantic and syntactic relations between the words. But the best Pearson correlation of the semantic relatedness of word vectors is about 75 %. Meanings of larger units, calculated compositionally is still an issue for NLP and NLP deep learning applications. The focus point of this study is predicates which are seen as relations or functions over arguments by Gottlob Frege. To analyse the "concept effect" and the "phrase effect" separately different models are formed. In the first model, sentence is separated into phrases. Then the sentence is checked according to the predicate if it can take the phrase or not. In our second model, the concepts of the phrases are found. Then the sentence is checked according to the predicate if it can take the phrase-concept pair or not. For example if the subject is a dog, it is in the animal concept class, the predicate of the sentence should be in the verb class which is suitable with animal concept class as subject. In this example predicate should not be the predicate "akmak (to flow)" which is in liquid concept verb class or the predicate "düşünmek (to think)" which is in human concept verb class. It can not be said directly that the concept is not suitable with the predicate. Compatibility with the predicate can change according to phrase type. For example it is possible to say "Ali thought the dog." but we can not say "The dog thought.". The predicate "to think" can not take "dog" (animal concept) as subject phrase but it can take "dog" as object phrase. We can give another example with "river", liquid concept. Dere yavaşça akıyordu. (The river flows gently.), Dereye düştü. (He fell into the river.), Balık derede yüzüyor. (Fish swims in the river.). Here "dere (river)" (liquid concept) is compatible with "akmak (to flow)" as subject phrase, is compatible with "düşmek (to fall)" as dative phrase and is compatible with "yüzmek (to swim)" as locative phrase. In our study, we represented the sentence as Cartesian product of phrase types and basic concept classes as 10x51 matrix. In morphologically rich languages, the meaning of a word is strongly affected by the suffixes that are attached to it. Some suffixes and morphological structure give information about meaning. In Turkish, especially the verb takes different types of suffixes. The verb suffixes can affect the phrases that sentence can have and can give information about time, possession and valence. These suffixes are considered in the study. The verb root type also affects the phrases that sentence can have. In Turkish Possessive Suffixes of the predicate and subject of the sentence should be compatible. Tense and Mood Suffixes of the predicate should be compatible with the time phrase in the sentence. The verb valency changing suffix directly effects the phrases that the predicate can take. Verb valency refers to the number of arguments controlled by a verbal predicate. It plays an important role in a number of the syntactic frameworks that have been developed in the last few decades. Basically VerbNet, FrameNet and ProbeBank define their arguments according to predicate. For ten years, concept relation is also studied with verb on Corpus pattern analysis. In our study, decomposition of phrase and concept pairs overlap the roles of VerbNet at some points. For example if Turkish dative phrase (directs to the X) has location concept it is equalized with "Goal" role in VerbNet, if Turkish ablative phrase (away from the X) has location concept it is equalized with "Source" role in VerbNet. In the coverage of this study, the concepts and verb classes are determined. The basic-level concept categorization depended on the nature of everyday human interaction both in a physical environment and in a culture. In our study, for the concept base sentence analysing part, the concepts are selected from the ontological representations databases like WordNet. We pay attention to meet the roles of sentences like VerbNet via concept selection. The concepts are determined through the guidance of the two points. One of them is good sentence representation which may vary according to the application and the other point is issue of revealing which concepts are determining factor for the predicate compatibility. For filling the related noun phrases lists for each concepts, Balkanet and some special databases and dictionaries are used. Some categorical terms are maintained from the Turkish Dictionary. From the viewpoint of predicate, predicates are categorized according to verb classes. The concept of the noun phrases are directly related with the verb classes. In our study we have 51 concepts for each phrases and 510 verb-classes. For example one of the verb classes has verbs that take time as object phrase and one of the other verb classes has verbs that take location as dative phrase. We had verb lists according to compatibility of each six phrases from Turkish Language Association (TDK) as a trustworthy source for lexical datasets and dictionaries. Verb categorization according to the other phrases and concept is done by ourselves in this study. When we find phrase-concepts pairs of the sentence, it is represented with matrix. One of the interests of the NLP community is to find representations to process the large amount of unlabelled language data. In our study syntactic and semantic information of a sentence is represented by 10x51 matrix (or 510x1 vector). Due to the concern of data and model visualization some concept classes are combined together under the 51 basic class so the concept space that are considered via the checker get smaller in this study. The element of the matrix can take the value between 0-1. 1 means the sentence have related phrase-concept pair and 0 means the sentence does not have related phrase-concept pair. "Ayşe" is a person as subject, "kırılan kalemi (broken pen)" is an object as accusative phrase, "sevdiği evinden (from her home that she loves)" is a location as ablative phrase, "okula (to the school)" is location as dative phrase and "sevinçle (with happiness)" is instrument phrase. In our model, automatic semantic labelling and grammatical and semantic checking of the sentence according to predicate is done simultaneously. In the model, first of all the sentence is preprocessed via İTU Turkish NLP Web Service. Dependency Parsing result of the sentence is the input variable for our grammatical and semantic analyser. As the basic model, after the sentence is preprocessed via İTU NLP Web Service, the Phrase Finder finds the 10 different phrases in the sentence via using dependency of words and case makers. After the phrases are found, all phrases are categorized according to the concepts. As a result, the matrix representation of the sentence is prepared as X matrix. If X matrix has the phrase concept pair, the related element of the matrix has 1, if X matrix does not have the phrase concept pair, the related element of the matrix has 0 value. On the other hand, verb is searched in the verb classes related with phrase concept pairs and predicate compatibility matrix is prepared as Y matrix. X matrix is generated by the observed phrase concept pairs of sentence and Y matrix shows the capacity of phrases-concept pairs that sentence may have. The problem exists when the element of Y is equal to 0 but conjugate element of X is 1. It means the verb does not take a concept on that phrase but sentence has the concept. Result is calculated by the function F = X' + Y. As a summary of grammatical and semantic analysis part, we divide our model to five parts to see each model's contribution separately. The first model checks the predicate according to phrase that sentence can have. The second model checks the predicate according to phrase-concept pairs that sentence can have. In the third model, valency changing suffixes effect is added onto the second model. In the fourth model, predicate possession suffixes and subject phrases compatibility effect is added onto the third model. And in the last model predicate time suffixes and time phrases compatibility effect is added onto the fourth model. These 5 model are regenerated with the way that the sub sentences are also considered. Determination of the structural error with our new method reached %64 accuracy at sixth model inside the %81.16 part of structural errors. It means we reached the %81.34 success inside the target part of structural errors. As a result of this study a detailed semantic vector representation of sentence is formed and grammatical and semantic analysis of sentence is done with presented feature vector comparison process. The sentences are separated to its sub sentences with the 81.16% success ratio. The phrases of the sentences are found with the 89% success ratio. The concept of the sentences are found with the 82.8% success ratio. The time class of the sentence is determined and compared with the predicate according to its compatibility, the subject phrase type and singularity/plurality issues are searched. The study is aimed to be a resource for Grammatical and Semantic Analysis of Turkish Sentences and Texts. Even though the study is done for Turkish, the method of representing the semantic arguments of the sentence with concept phrase pairs can be applied to all languages. Sentence is represented with its phrase-concept pairs as coarse grained semantic matrix. This coarse-grained semantic matrix representation of sentences (texts) can be used as an input for a great deal of semantic applications such as question answering, information extraction and text categorization.