İşitme engelli bireyler, yazı ve söz ile haberleşen engelsiz bireylerden farklı olarak, el ve beden hareketlerine dayalı işaret dilleri ile haberleşmektedirler. İşaret dilleri, işitme engelli bireylere ait doğal diller olup diğer işitsel doğal diller gibi toplumların yaşadığı coğrafi ve kültürel çevreye göre farklılaşmaktadır. Türk İşaret Dili ülkemizde resmi olarak tanınan ve Türkiye'de yaşayan sağır bireylerin kullandığı doğal dildir. İşitme engelli bireylerin engelsiz bireyler ile yazılı veya sözlü yollarla rahat iletişim kuramamaları, bu bireylerin bilgiye ulaşmalarında, eğitimlerinde ve iş olanaklarına erişimlerinde çeşitli sorunlara neden olmaktadır. Teknolojinin günümüzde ulaştığı seviye sayesinde bu tür sorunlara bilgisayar destekli çözümler üretilmeye başlanmıştır. Bu bağlamda, işitsel ve işaret dilleri arasındaki çevirinin belirli bir düzeyde otomatik olarak yapılmasını mümkün kılan çeviri sistemleri ön plana çıkmaktadır. Gelişmiş ülkelerde yaşayan sağır toplumlarına ait işaret dilleri için makine çevirisi alanındaki çalışmalar artarak devam etmesine rağmen henüz Türk İşaret Dili için bu konuda bir çalışma yapılmamıştır.
Bu tez kapsamında, yazılı Türkçe'den Türk İşaret Diline makine çevirisi için gerekli olan bilgisayarlı çeviri altyapısının oluşturulması hedeflenmiştir. Tez çalışması bu konuda yürütülen ilk akademik çalışma olma niteliğindedir. Oluşturulan çeviri sisteminin çıktısının makine-okunur bir biçimde ve bu yolla animasyon sistemlerine (avatar veya insansı robot) girdi oluşturabilir nitelikte olması hedeflenmiştir. Bu amaçla, işaret dillerinin bilgisayar ile işlenebilmesine yönelik gösterimler, Türkçe-TİD makine çevirisinde kullanılacak Türk İşaret Dilinin dilbilgisi özellikleri, Türkçe-TİD makine çevirisinde kullanılacak doğal dil işleme yöntemleri, işaret dilleri üzerine yapılmış makine çevirisi çalışmaları, işaret dilleri için geliştirilmiş sayısal sözlükler, işaret dilleri çözümleme ve etiketleme altyapıları incelenmiştir. İstatistiksel makine çevirisi sistemleri için vazgeçilmez olan büyük boyutlu makine-okunur paralel veri kümelerinin Türkçe-TİD dil çifti için var olmaması nedeniyle, Türkçe'den-TİD'e ilk makine çevirisi denemesi olan bu tez çalışmasında kural tabanlı makine çevirisi yöntemleri tercih edilmiştir.
Bu çerçevede, çalışmamızın bilime yaptığı katkılar aşağıda sıralanmıştır:
-Türk İşaret Dilinin makine-okunur bilgi temsili oluşturulmuş,
-İşitsel bir dil ile bir işaret dili arasında bağlılık analizi yöntemine dayalı ağaçyapılı paralel derlem çalışması yapılmış,
-Bilgisayarlı çeviri çalışmalarında kullanılabilecek nitelikte etiketlemeleri mümkün kılan Türk İşaret Dili sayısal sözlük altyapısı oluşturulmuş,
-İşaret dili söylemlerinin dilbilimsel etiketlenmesinde yaygın olarak kullanılan ELAN yazılımına makine-okunur çıktılar üretebilmesi için TİD'e özgü eklenti geliştirilmiş,
-Geliştirilen etiketleme altyapıları kullanarak Türkçe-TİD paralel veri kümelerinin makine-okunur şekilde nasıl oluşturulabileceği tanımlanmış ve prototip derlem çalışmaları ortaya konmuş,
-Türkçe ve TİD için ontoloji altyapısı geliştirilmiş,
-Yazılı Türkçe'den-TİD'e makine çevirisi için kural tabanlı sözdizimsel ve kısmen anlamsal düzeyde transfere dayalı bir çeviri sistemi tasarlanmış ve temel çeviri kuralları oluşturulmuştur.
Önerilen çeviri altyapısı Milli Eğitim Bakanlığı ilkokul ders kitaplarından seçilerek TÜBİTAK 114E263 nolu proje kapsamında hazırlanan 306 adet Türkçe-TİD paralel cümle üzerinde test edilmiş ve transfer başarımlarının kabul edilebilir düzeyde olduğu gösterilmiştir. Çeviri sistemi genişlemeye açık bir mimaride tasarlanmıştır. TİD dilbilim araştırmaları sonucunda elde edilen yeni kurallar, sisteme kolaylıkla dahil edilebilecektir. Ayrıca TİD sözlüğe eklenecek yeni işaret girdileri ve Türkçe sözcük ağında oluşturulacak yeni anlamsal ilişkiler ile çeviri sisteminin başarımının artırılması mümkün olacaktır.
|
Computer processing of human languages has been a research topic of interest ever
since the invention of computers. Sign languages are the native languages of many
prelingually deaf people. As it is the case for spoken languages, sign languages spoken
in different countries/communities differ substantially from each other (and also from
the spoken languages used in these countries) at lexical, morphological and syntactic
levels, and systems tailored for a specific sign language are most of the time not directly
applicable for another one. Although sign languages are real human languages, the
research focused on their computerized processing remains rather limited compared to
that for spoken languages. A very important reason behind this phenomenon is the lack
of data resources (usable in computerized systems) for most of the under-studied sign
languages. The unconventionality of written sign language representations naturally
makes the collection of such resources even harder; i.e., since sign languages are
commonly not written languages, there is no written corpus available that would serve
as the data for computational studies.
The difficulties of hearing impaired individuals in communicating smoothly in written
or verbal ways causes obstacles in their access to information, job opportunities and
in their education. Hearing impaired individuals use sign languages that are visual
and animated languages as their natural language. Turkish Sign Language is a natural
language officially recognized in our country and used by deaf individuals living in
Turkey. As is the case for many lesser studied natural languages, T ˙ ID also introduces
unique challenges in natural language processing area. The development in computer
technology makes possible to automatically perform the translation between oral and
sign languages at a certain level.
In every country, different native sign languages are used (e.g. T ˙ ID in Turkey, ASL in
the U.S.A. etc.), and these sign languages have linguistic properties that are different
from the linguistic properties of the spoken language(s) spoken in those countries.
There are studies that have been conducted in order to produce translation systems for
deaf people to translate official documents and education materials from written text
to sign language. There are active studies conducted on developing signing avatars for
Tunisian, German, English, Dutch, French and American sign languages. However,
sign languages are substantially different from each other as spoken languages are.
Therefore, a machine translation system that has been developed for another language
cannot be used for a Turkish-TİD translation system directly.
In a few recent translation studies from Turkish to TİD, only selected words were
translated into T ˙ ID signs (with pictures/photos, videos and avatar animations).
However, this approach leads to incorrect translations at the sentence level. Turkish and
TİD are different languages and as all spoken and sign language translation systems,
this issue is a machine translation problem that has to be studied on syntactic and
semantic levels. Because of this, it contains all the challenges of machine translation.
Again because of this, the recent systems that have been developed for Turkish have
been nothing more than a limited dictionary-like system that translates words to signs.
This thesis aims to develop a machine translation infrastructure to be used in the
translation of written Turkish materials into Turkish Sign Language. The work
introduced in this thesis is the first academic study conducted on this topic. The
output of the translation system is aimed to be a machine-readable representation
of T ˙ ID so that it may be fed to animation systems (e.g., avatar or humanoid robot)
as input. With this aim, the grammatical properties of Turkish Sign Language that
will be used in Turkish-T ˙ ID machine translation, natural language processing methods
necessary for this translation, previous machine translation studies for sign languages,
electronic sign-language dictionaries, and sign languages manual annotation platforms
are investigated within the thesis. Since T ˙ ID-Turkish language pair lacks of bilingual
data resources, we are compelled to choose RBMT(rule-based machine translation)for
our initial translation system. With increase in the number of bilingual text corpora,
it would become possible to create example-based and statistical machine translation
systems or hybrid ones. The representation scheme proposed in this thesis aims to
remove the obstructions in front of this process and pave the way for rapid resource
creation.
Turkish, as a morphologically rich language with flexible word order presents challenges for natural language processing that are different from other widely studied languages such as English. Therefore, one cannot directly apply the methods and findings from other languages to Turkish.
In this respect, the introduced structure is treated to be
valuable for similar agglutinative oral languages (e.g., Finnish, Hungarian and Korean) and sign language pairs.
This system will increase the natural language interaction between students and teachers and contribute to studies on computer-assisted cooperation. In line with MEB's policies, the communication in this setting will be from teacher to student. In other words, the content that the curriculum/teacher aims at delivering will be transformed into a form which can be understood by a deaf student in a more efficient way and thus, enabling the student to adopt to the mixed education classroom setting in a quicker way.
We use a transfer-based machine translation approach, where our transfer model is the stage consisting of the translation rules from Turkish to TİD.
The input to the translation rules component is the analysis of the source language (produced via the Turkish NLP pipeline) and the output which is going to feed the animation layer is a generated machine readable representation of the target language (TİD).
In our transfer model, we aim to use both syntactic and semantic transfer. To this aim, the formalism chosen for both Turkish and TİD syntactic representation is the dependency formalism.
In case it is not possible to find an equivalent sign entry with the same lexical sense of a Turkish input,
we aim to map our senses to concepts for semantic transfer adapted from the ``Lexicon Model for Ontologies (LEMON)''. Although Lemon supports some features for agglutinative languages, it seems hard to represent all the possible word lexical forms for Turkish due to its highly agglutinative complex nature, which complicates the creation of morphological generation rules (handled mostly by the use of finite state transducers in the literature).
A straightforward solution to this is proposed by using the Turkish NLP pipeline to reach lexical entries from provided lexical forms.
This thesis introduces a machine-readable knowledge representation of Turkish Sign Language for the first time in the literature. One of the biggest handicaps confronting statistical machine translation systems for sign languages is the collection of bilingual text corpora in machine-readable form, which is a crucial component in the current state-of-the-art approaches. The representation scheme proposed in this thesis also aims to remove the obstructions in front of this process and pave the way for rapid resource creation. The introduced machine readable representation scheme of TİD is linked to ELAN annotation tool in order to produce such corpora and the input for the avatar system to generate natural looking continuous sign sequences. This study also generates an online dictionary platform which houses the unique glosses of the signs, possible variations, and layers required to feed the ELAN tool with adequate depth of information. The developed sign language infrastructure, as well as the sign database and corpus to be generated as a part of the system, will be vital for the researchers working on the TİD domain.
The contributions of the thesis are as follows:
-A machine-readable knowledge representation was proposed for Turkish Sign Language,
-A parallel treebank study based on dependency formalism was conducted for an oral language and sign language pair,
-A Turkish sign language electronic dictionary infrastructure, which makes possible to use the annotations in machine translation studies, was developed,
-A TİD-specific plugin to ELAN manual annotation platform (which is widely used in linguistic annotation of sign language discourse) was developed so that it can produce machine-readable annotations to be used in machine translation studies,
-A prototype to develop Turkish-TİD parallel data sets for machine translation studies (using the proposed annotation infrastructures) was introduced,
-Ontology infrastructure for Turkish and TİD was developed,
-A rule based machine translation system from written Turkish to TİD
based on syntactic and partially semantic transfer was designed and basic translation rules were proposed.
The proposed machine translation infrastructure was tested on a parallel text composed of 306 Turkish-TID parallel sentences (selected from primary school textbooks and prepared within the scope of the project TÜBİTAK 114E263). The transfer success rates were shown to fall within acceptable performance levels.
The translation system architecture is designed to be expandable. The new rules obtained as a result of TİD linguistic researches will be easily incorporated into the system. In addition, new sign inputs to be added to the TİD dictionary and new semantic relations to be created in the Turkish word network will enhance the performance of the translation system. |