Tokenization of Japanese Text: Using a Morphological Transducer
Date
2018
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Producer
Director
Performer
Choreographer
Costume Designer
Music
Videographer
Lighting Designer
Set Designer
Crew Member
Funder
Rehearsal Director
Concert Coordinator
Advisor
Moderator
Panelist
Alternative Title
Department
Swarthmore College. Dept. of Linguistics
Type
Original Format
Running Time
File Format
Place of Publication
Date Span
Copyright Date
Award
Language
en
Note
Table of Contents
Terms of Use
Full copyright to this work is retained by the student author. It may only be used for non-commercial, research, and educational purposes. All other uses are restricted.
Rights Holder
Access Restrictions
No restrictions
Terms of Use
Tripod URL
Identifier
Abstract
Word segmenters comprise a vital step in the methodology of natural language
processing. In languages such as English, which already necessitate word delimiters
such as spaces, this task is trivial. However, in non-segmented languages such as
Japanese and Chinese, a translator must accurately identify every word in a sentence
before or as they attempt to parse it, and to do that requires a method of finding word
boundaries without the aid of word delimiters. Much has been done in this field for the
case of Chinese, as Chinese is a highly isolating language which makes the task of
identifying morphological units almost isomorphic to the task of identifying syntactic
units. As such, many functional Chinese Word Segmenter models already exist. But
1
Japanese, on the other hand, is a synthetic language that utilizes both inflectional and
agglutinative morphology, and so the tasks of identifying morphological units and
syntactic units are more separate. However, much work has also been done in the field
of mapping inflected Japanese words to their root form, a process known as
transduction. In this paper, I modify an existing Chinese Word Segmenter to incorporate
an existing Japanese transducer into its segmentation process: specifically, the
transducer's ability to detect the validity of a combination of characters is used in
parallel with dynamic programming's ability to compute all possible combinations of
characters in a string to find the overall number of valid tokens in a given input string.
Testing shows that this approach does indeed give valid results; furthermore, its ability
to return information about the grammatical tags of each token suggests that further
extensions of the program could not only tokenize the text, but also infer information
about its syntactic meaning in the clause.