Creator Grady Ward
Licence Public Domain
Notes A huge collection of lexical data which was created and released into the public domain by Grady Ward.
Links
Grady Ward's Moby, updated 2000-10-24

Wikipedia Article, updated 2008-12-28
Files
moby.tar.bz2, updated 2008-08-05, download size 17714524 bytes
A local copy of the original distribution, recompressed with bzip2.

moby.tar.Z, updated 2008-08-05, download size 25939528 bytes
The complete original distribution.

import_moby.sql.bz2, updated 2009-02-17, download size 31592548 bytes
PostgreSQL dump of the imported data from the Moby Project.

This page contains a detailed analysis of the data that was distributed by the Moby project and the steps taken to convert it to a standardised form, suitable for use in the WIXML language database.


Data Conversion Issues

top of page

Although most of the files in Moby are plain Macintosh text files, there were a number of data conversion issues which required special attention. The following table gives an overview of the files in the collection and how they have been handled.

Key File Lines Note Word Case Encoding
1 mpos/mobyposi.i 233355 parts of speech true MACINTOSH
2 mpos/readme 84
3 mhyph/mhyph.txt 187175 hyphenation dictionary true MACINTOSH
4 mhyph/readme 75
5 mlang/german.txt 159809 German words true true 437
6 mlang/japanese.txt 115519 Japanese words true MACINTOSH
7 mlang/spanish.txt 86059 Spanish words true
8 mlang/italian.txt 60453 Italian words true
9 mlang/readme 49
10 mlang/french.txt 138257 French words true
11 mpron/creadme 53
12 mpron/mobypron.unc 177267 pronunciation true MACINTOSH
13 mpron/cmudict0.3 110989 CMU pronunciation dictionary
14 mpron/phoneset.3 43
15 mpron/readme 152
16 mshak/shakespe.are 127568 complete works of Shakespeare true
17 mshak/readme 88
18 mthes/roget13a.txt 26588
19 mthes/readme 102
20 mthes/mobythes.aur 30260 thesaurus true
21 mwords/10001fr.equ 1002 frequency common words
22 mwords/10002fr.equ 1001 frequency usenet
23 mwords/6213acro.nym 6213 acronyms true true
24 mwords/usaconst.itu 684 American constitution true MACINTOSH
25 mwords/74550com.mon 74550 common words true true
26 mwords/3897male.nam 3897 male names true true
27 mwords/10196pla.ces 10196 American place names true true
28 mwords/4160offi.cia 4160 official Scrabble supplement true
29 mwords/113809of.fic 113809 official crossword words true
30 mwords/readme.txt 100
31 mwords/354984si.ngl 354984 single words true true
32 mwords/366often.mis 366 often misspelled words true true
33 mwords/21986na.mes 21986 common names true true 437
34 mwords/467popul.arf 467 frequency fiction
35 mwords/256772co.mpo 256765 compound words true true MACINTOSH
36 mwords/4946fema.len 4946 female names true true
37 mwords/1185kjvf.req 1185 frequency Bible


Errors and Corrections

top of page

In most files, collocations were defined using an underscore character instead of a space. The following words in the pronunciation file did not use underscores and were corrected before further processing: betes noires, hors d'oeuvre, leit motiv, maitre d'hotel, objet d'art, raison d'etre.

Spurious characters were found in some files.

Please email if you know of any other errors which are not listed here.


Character Encodings

top of page

The language files used various ASCII representations for accented characters. The following character sequences in the Spanish, Italian and French files were converted to UTF-8 encoding before processing. Note that the French language file seems to use a different method from the other files.

Files ASCII UTF8 Character Files ASCII UTF8 Character
Italian a' à "a" grave Spanish i` í "i" acute
French a` à "a" grave French i^ î "i" circumflex
Spanish a` á "a" acute French i" ï "i" diaresis
French a^ â "a" circumflex Italian o' ò "o" grave
French a" ä "a" diaresis French o` ò "o" grave
French c/ ç "c" cedilla Spanish
Italian
o` ó "o" acute
Italian e' è "e" grave French o^ ô "o" circumflex
French e` è "e" grave French o" ö "o" diaresis
French e' é "e" acute Italian u' ù "u" grave
Spanish e` é "e" acute French u` ù "u" grave
French e^ ê "e" circumflex Spanish
Italian
u` ú "u" acute
French e" ë "e" diaresis French u^ û "u" circumflex
Italian i' ì "i" grave French u" ü "u" diaresis
French i` ì "i" grave


Pronunciation Scheme

top of page

The Moby pronunciation file uses a scheme that approximates the International Phonetic Alphabet (IPA) using ASCII. The following table was compiled from information included with the Moby distribution, and also a suggested mapping described in the Wikipedia article about the Moby Project. Please note that this table is known to be incomplete, and applying it to the Moby pronunciation file is still a work in progress. For now, the pronunciation table in the WIXML database only contains the original ASCII version of the pronunciation.

ASCII UTF8 Examples ASCII UTF8 Examples
& æ a in dab f f f in elf
(@) ɛ a in air g g g in fig
- ə ir glide in tire
dl glide in handle
den glide in sodden
h h h in had
@ ʌ a in ado
u in cup
glide e in system
hw ʍ w in white
@r ɚ u in burn i i e in see
A ɑ a in ami
a in far
o in bob
ir ɪɹ
AU ow in how j j y in you
D ð th in the k k c in act
E ɛ e in red l l l in ail
I ɪ i in hid m m m in aim
N ŋ ng in bang
n in Francoise
n n n in and
O ɔ o in dog oU o in boat
Oi ɔɪ oi in oil oUr ɔɹ
R ɻ r in Der p p p in imp
S ʃ sh in she r ɹ r in ire
T θ th in bath s s s in sip
U ʊ oo in book t t t in tap
Y u u in Dubois tS ʧ ch in ouch
Z ʒ s in vision u u oo in too
[@]r ɝ v v v in average
aI i in ice w w w in win
b b b in nab x x ch in Bach
d d d in pod y y eu in cordon bleu
dZ ʤ g in vegetable z z z in zoo
eI a in day

Bear in mind that the specification of pronunciation varies in precision from narrow or allophonic transcription (specific sounds) to broad or phonemic transcription (similar sounds).

Phonemic transcription is indicated by enclosing the symbols between slashes '/' and allophonic transcription is indicated by enclosing the symbols between square brackets '[' and ']'. The vast majority of the Moby pronunciation data is phonemic, meaning that it would be sufficiently precise for speech recognition applications, but not for high quality speech synthesis.

For more information please see the Wikipedia article on the subject.


Database Implementation

top of page

To install this database with PostgreSQL, download the SQL database dump from import_moby.sql.bz2 and issue the following shell commands from the Linux command line.

createdb wixml
bunzip2 import_moby.sql.bz2
psql -f import_moby.sql -d wixml
vacuumdb -z -d wixml

Schema Overview

top of page

As the diagram shows, the entities extracted from Moby can be divided into two levels. The first level contains lists of words, hyphenations and pronunciations. The second level records the associations between these entities. Since all these associations are many-to-many in nature, it is necessary to store words, hyphenations and pronunciations in different tables.

While it might seem unexpected that a hyphenation could be associated with more than one word, this is because the word list contains both plain and accented versions of many words. In some cases there is more than one accented version of a word.

Note that there are no separate tables for language or proper noun classes in this database design. This information is stored in an array of file numbers associated with each word record. For example, a word is french if the file number array contains 10 and german if it contains 5. Of course some words could be present in both languages. The following table lists all of the relevant file numbers and the predicates they represent.

Key Note Key Note
5 German words 27 American place names
6 Japanese words 28 official Scrabble supplement
7 Spanish words 29 official crossword words
8 Italian words 31 single words
10 French words 32 often misspelled words
23 acronyms 33 common names
25 common words 35 compound words
26 male names 36 female names


Schema Definition

top of page

schema Import_Moby

table tClass
kClass integer primary key
fName array of text not null unique
sWords integer
fCode1 text
fCode2 text

table tSynonym
kSynonym integer primary key
fGroup integer
fIndex integer
pWord integer not null references tWord
unique (fGroup, fIndex, pWord)

table tFile
kFile integer primary key
fName text not null unique
sLines integer
fNote text
fWord boolean
fCase boolean
fEncoding text

table tWord
kWord integer primary key
fText text not null unique
fFiles array of integer

table tHyphenation
kHyphenation integer primary key
fBreaks array of text not null unique

table tWordClass
kWordClass integer primary key
pWord integer not null references tWord
pClass integer not null references tClass
fRank integer
unique (pWord, pClass)

table tPhoneme
kPhoneme integer primary key
fCode text not null unique
fIPA text
fExamples array of text

table tWordHyphenation
kWordHyphenation integer primary key
pWord integer not null references tWord
pHyphenation integer not null references tHyphenation
unique (pWord, pHyphenation)

table tPronunciation
kPronunciation integer primary key
fPhonemes text not null unique

table tWordPronunciation
kWordPronunciation integer primary key
pWord integer not null references tWord
pPronunciation integer not null references tPronunciation
pClass integer references tClass
unique (pWord, pPronunciation, pClass)

top of page


Database development, website content and design, copyright 2009 by Andrew Smith





Spreadfirefox Affiliate Button