Institutional Scholarship

Tokenization of Japanese Text: Using a Morphological Transducer

Show simple item record

dc.contributor.advisor Washington, Jonathan Hanlon, Clare 2018-01-29T16:22:49Z 2018-01-29T16:22:49Z 2018
dc.description.abstract Word segmenters comprise a vital step in the methodology of natural language processing. In languages such as English, which already necessitate word delimiters such as spaces, this task is trivial. However, in non-segmented languages such as Japanese and Chinese, a translator must accurately identify every word in a sentence before or as they attempt to parse it, and to do that requires a method of finding word boundaries without the aid of word delimiters. Much has been done in this field for the case of Chinese, as Chinese is a highly isolating language which makes the task of identifying morphological units almost isomorphic to the task of identifying syntactic units. As such, many functional Chinese Word Segmenter models already exist. But 1 Japanese, on the other hand, is a synthetic language that utilizes both inflectional and agglutinative morphology, and so the tasks of identifying morphological units and syntactic units are more separate. However, much work has also been done in the field of mapping inflected Japanese words to their root form, a process known as transduction. In this paper, I modify an existing Chinese Word Segmenter to incorporate an existing Japanese transducer into its segmentation process: specifically, the transducer's ability to detect the validity of a combination of characters is used in parallel with dynamic programming's ability to compute all possible combinations of characters in a string to find the overall number of valid tokens in a given input string. Testing shows that this approach does indeed give valid results; furthermore, its ability to return information about the grammatical tags of each token suggests that further extensions of the program could not only tokenize the text, but also infer information about its syntactic meaning in the clause. en_US
dc.description.sponsorship Swarthmore College. Dept. of Linguistics en_US
dc.language.iso en en_US
dc.rights Full copyright to this work is retained by the student author. It may only be used for non-commercial, research, and educational purposes. All other uses are restricted.
dc.title Tokenization of Japanese Text: Using a Morphological Transducer en_US
dc.rights.access No restrictions en_US

Files in this item

This item appears in the following Collection(s)

Show simple item record



My Account