Speech and text driven 3D face synethesis for the hearing impaired

Savran, Arman

View/Open

File_169702 (5.431Mb)

Date

2004

Author

Savran, Arman

Metadata

Show full item record

Abstract

ÖZET «. işitme engelliler için konuşma ve metinden üç boyutlu yüz sentezleme Bu tezin amacı, dudak okuma vasıtasıyla işitme engellilere yardımcı olmak için, herhangi bir insanın konuşmasından görsel konuşma oluşturan bir sistem geliştirmektir. Bu çalışmada, MPEG-4 yüz animasyonunu oynatmak için yüz noktalarını sentezleyen bir sistem gerçekleştirilmiştir. Gerçekçi ve doğal konuşma animasyonu oluşturabilmek amacıyla, bir konuşmacıdan alman işitsel ve görsel veriler ile eğitilen, koddefteri tabanlı bir teknik kullanılmıştır. Eğitim sadece bir konuşmacı ile gerçekleştirildiğinden, bu teknik konuşmacı-bağımlıdır ve farklı konuşmacılar tarafından kullanıldığında perfor mans önemli ölçüde düşebilir. Sistemin konuşmacı-bağımsız performansım iyileştirmek için, tek-konuşmacılı koddefterinin az sayida konuşmacıdan alman ses verileri kul lanılarak genişletilmesiyle, yeni bir koddefteri oluşturulmuştur. Sistemin eğitimi için, fonetik olarak dengeli Türkçe metinler kullanılarak, işitsel-görsel ve sadece-işitsel veri tabanları hazırlanmıştır. Senkronize işitsel ve görsel verileri toplamak için, bir üç boyutlu yüz hareketi yakalama sistemi geliştirilmiştir. Bu sistem, konuşmacıların üç boyutlu yüz noktalarını izleyip oluşturmak için bir stereo kamera ve yuvarlak etiketler den yararlanır, ve videoyu işlemek için bir kişisel bilgisayara ihtiyaç duyar. Sistemin sentezleme performansı çeşitli testler yapılarak ölçülmüştür. Sistem, harhangi bir Türk konuşmacının sesinden, görsel konuşma için yüzleri canlandnabilmektedir.

IV ABSTRACT SPEECH AND TEXT DRIVEN 3D FACE SYNTHESIS FOR THE HEARING IMPAIRED The goal of this thesis is to develop a system that generates visual speech from an input speech of any speaker, in order to aid hearing impaired by means of lip reading. In this study, an initial system that synthesizes face points to drive an MPEG-4 facial animation engine was implemented. To produce realistic and natural speech animation, a codebook based technique, which is trained by audio and visual data from a speaker, was employed. Since training is performed with only one speaker, this technique is speaker-dependent and the performance can be degraded considerably when used by different speakers. To improve the speaker-independent performance of the system, a new codebook was created by extending the single-speaker codebook with auido data from a small number of speakers. For the training of the system, audio-visual and audio-only speech databases were collected using a phonetically balanced Turkish speech corpus. To capture the synchronized audio-visual data, a 3D facial motion capture system was developed. This data capture system employs a stereo camera and circular stickers to track and reconstruct 3D face points of the speakers, and requires a single PC to stream and process video. The synthesis performance of the system was evaluated by performing objective tests. The system is capable of animating faces for the visual speech from an input speech of any Turkish speaker.

URI

https://acikbilim.yok.gov.tr/handle/20.500.12812/77485

Collections

TEZLER

Except where otherwise noted, this item's license is described as info:eu-repo/semantics/embargoedAccess