Tokenization of Japanese Text: Using a Morphological Transducer

Date
2018
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Moderator
Panelist
Alternative Title
Department
Swarthmore College. Dept. of Linguistics
Type
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
en
Note
Table of Contents
Terms of Use
Full copyright to this work is retained by the student author. It may only be used for non-commercial, research, and educational purposes. All other uses are restricted.
Rights Holder
Access Restrictions
No restrictions
Terms of Use
Tripod URL
Identifier
Abstract
Word segmenters comprise a vital step in the methodology of natural language processing. In languages such as English, which already necessitate word delimiters such as spaces, this task is trivial. However, in non-segmented languages such as Japanese and Chinese, a translator must accurately identify every word in a sentence before or as they attempt to parse it, and to do that requires a method of finding word boundaries without the aid of word delimiters. Much has been done in this field for the case of Chinese, as Chinese is a highly isolating language which makes the task of identifying morphological units almost isomorphic to the task of identifying syntactic units. As such, many functional Chinese Word Segmenter models already exist. But 1 Japanese, on the other hand, is a synthetic language that utilizes both inflectional and agglutinative morphology, and so the tasks of identifying morphological units and syntactic units are more separate. However, much work has also been done in the field of mapping inflected Japanese words to their root form, a process known as transduction. In this paper, I modify an existing Chinese Word Segmenter to incorporate an existing Japanese transducer into its segmentation process: specifically, the transducer's ability to detect the validity of a combination of characters is used in parallel with dynamic programming's ability to compute all possible combinations of characters in a string to find the overall number of valid tokens in a given input string. Testing shows that this approach does indeed give valid results; furthermore, its ability to return information about the grammatical tags of each token suggests that further extensions of the program could not only tokenize the text, but also infer information about its syntactic meaning in the clause.
Description
Subjects
Citation
Collections