Sources > Moby |
| Creator | Grady Ward |
|---|---|
| Licence | Public Domain |
| Notes | A huge collection of lexical data which was created and released into the public domain by Grady Ward. |
| Links |
Grady Ward's Moby, updated 2000-10-24
Wikipedia Article, updated 2008-12-28
|
| Files |
A local copy of the original distribution, recompressed with bzip2.
The complete original distribution.
PostgreSQL dump of the imported data from the Moby Project.
|
This page contains a detailed analysis of the data that was distributed by the Moby project and the steps taken to convert it to a standardised form, suitable for use in the WIXML language database.
Although most of the files in Moby are plain Macintosh text files, there were a number of data conversion issues which required special attention. The following table gives an overview of the files in the collection and how they have been handled.
| Key | File | Lines | Note | Word | Case | Encoding |
|---|---|---|---|---|---|---|
| 1 | mpos/mobyposi.i | 233355 | parts of speech | true | MACINTOSH | |
| 2 | mpos/readme | 84 | ||||
| 3 | mhyph/mhyph.txt | 187175 | hyphenation dictionary | true | MACINTOSH | |
| 4 | mhyph/readme | 75 | ||||
| 5 | mlang/german.txt | 159809 | German words | true | true | 437 |
| 6 | mlang/japanese.txt | 115519 | Japanese words | true | MACINTOSH | |
| 7 | mlang/spanish.txt | 86059 | Spanish words | true | ||
| 8 | mlang/italian.txt | 60453 | Italian words | true | ||
| 9 | mlang/readme | 49 | ||||
| 10 | mlang/french.txt | 138257 | French words | true | ||
| 11 | mpron/creadme | 53 | ||||
| 12 | mpron/mobypron.unc | 177267 | pronunciation | true | MACINTOSH | |
| 13 | mpron/cmudict0.3 | 110989 | CMU pronunciation dictionary | |||
| 14 | mpron/phoneset.3 | 43 | ||||
| 15 | mpron/readme | 152 | ||||
| 16 | mshak/shakespe.are | 127568 | complete works of Shakespeare | true | ||
| 17 | mshak/readme | 88 | ||||
| 18 | mthes/roget13a.txt | 26588 | ||||
| 19 | mthes/readme | 102 | ||||
| 20 | mthes/mobythes.aur | 30260 | thesaurus | true | ||
| 21 | mwords/10001fr.equ | 1002 | frequency common words | |||
| 22 | mwords/10002fr.equ | 1001 | frequency usenet | |||
| 23 | mwords/6213acro.nym | 6213 | acronyms | true | true | |
| 24 | mwords/usaconst.itu | 684 | American constitution | true | MACINTOSH | |
| 25 | mwords/74550com.mon | 74550 | common words | true | true | |
| 26 | mwords/3897male.nam | 3897 | male names | true | true | |
| 27 | mwords/10196pla.ces | 10196 | American place names | true | true | |
| 28 | mwords/4160offi.cia | 4160 | official Scrabble supplement | true | ||
| 29 | mwords/113809of.fic | 113809 | official crossword words | true | ||
| 30 | mwords/readme.txt | 100 | ||||
| 31 | mwords/354984si.ngl | 354984 | single words | true | true | |
| 32 | mwords/366often.mis | 366 | often misspelled words | true | true | |
| 33 | mwords/21986na.mes | 21986 | common names | true | true | 437 |
| 34 | mwords/467popul.arf | 467 | frequency fiction | |||
| 35 | mwords/256772co.mpo | 256765 | compound words | true | true | MACINTOSH |
| 36 | mwords/4946fema.len | 4946 | female names | true | true | |
| 37 | mwords/1185kjvf.req | 1185 | frequency Bible |
In most files, collocations were defined using an underscore character instead of a space. The following words in the pronunciation file did not use underscores and were corrected before further processing: betes noires, hors d'oeuvre, leit motiv, maitre d'hotel, objet d'art, raison d'etre.
Spurious characters were found in some files.
Please email if you know of any other errors which are not listed here.
The language files used various ASCII representations for accented characters. The following character sequences in the Spanish, Italian and French files were converted to UTF-8 encoding before processing. Note that the French language file seems to use a different method from the other files.
| Files | ASCII | UTF8 | Character | Files | ASCII | UTF8 | Character |
|---|---|---|---|---|---|---|---|
| Italian | a' | à | "a" grave | Spanish | i` | í | "i" acute |
| French | a` | à | "a" grave | French | i^ | î | "i" circumflex |
| Spanish | a` | á | "a" acute | French | i" | ï | "i" diaresis |
| French | a^ | â | "a" circumflex | Italian | o' | ò | "o" grave |
| French | a" | ä | "a" diaresis | French | o` | ò | "o" grave |
| French | c/ | ç | "c" cedilla | Spanish Italian |
o` | ó | "o" acute |
| Italian | e' | è | "e" grave | French | o^ | ô | "o" circumflex |
| French | e` | è | "e" grave | French | o" | ö | "o" diaresis |
| French | e' | é | "e" acute | Italian | u' | ù | "u" grave |
| Spanish | e` | é | "e" acute | French | u` | ù | "u" grave |
| French | e^ | ê | "e" circumflex | Spanish Italian |
u` | ú | "u" acute |
| French | e" | ë | "e" diaresis | French | u^ | û | "u" circumflex |
| Italian | i' | ì | "i" grave | French | u" | ü | "u" diaresis |
| French | i` | ì | "i" grave |
The Moby pronunciation file uses a scheme that approximates the International Phonetic Alphabet (IPA) using ASCII. The following table was compiled from information included with the Moby distribution, and also a suggested mapping described in the Wikipedia article about the Moby Project. Please note that this table is known to be incomplete, and applying it to the Moby pronunciation file is still a work in progress. For now, the pronunciation table in the WIXML database only contains the original ASCII version of the pronunciation.
| ASCII | UTF8 | Examples | ASCII | UTF8 | Examples |
|---|---|---|---|---|---|
| & | æ | a in dab | f | f | f in elf |
| (@) | ɛ | a in air | g | g | g in fig |
| - | ə | ir glide in tire dl glide in handle den glide in sodden |
h | h | h in had |
| @ | ʌ | a in ado u in cup glide e in system |
hw | ʍ | w in white |
| @r | ɚ | u in burn | i | i | e in see |
| A | ɑ | a in ami a in far o in bob |
ir | ɪɹ | |
| AU | aʊ | ow in how | j | j | y in you |
| D | ð | th in the | k | k | c in act |
| E | ɛ | e in red | l | l | l in ail |
| I | ɪ | i in hid | m | m | m in aim |
| N | ŋ | ng in bang n in Francoise |
n | n | n in and |
| O | ɔ | o in dog | oU | oʊ | o in boat |
| Oi | ɔɪ | oi in oil | oUr | ɔɹ | |
| R | ɻ | r in Der | p | p | p in imp |
| S | ʃ | sh in she | r | ɹ | r in ire |
| T | θ | th in bath | s | s | s in sip |
| U | ʊ | oo in book | t | t | t in tap |
| Y | u | u in Dubois | tS | ʧ | ch in ouch |
| Z | ʒ | s in vision | u | u | oo in too |
| [@]r | ɝ | v | v | v in average | |
| aI | aɪ | i in ice | w | w | w in win |
| b | b | b in nab | x | x | ch in Bach |
| d | d | d in pod | y | y | eu in cordon bleu |
| dZ | ʤ | g in vegetable | z | z | z in zoo |
| eI | eɪ | a in day |
Bear in mind that the specification of pronunciation varies in precision from narrow or allophonic transcription (specific sounds) to broad or phonemic transcription (similar sounds).
Phonemic transcription is indicated by enclosing the symbols between slashes '/' and allophonic transcription is indicated by enclosing the symbols between square brackets '[' and ']'. The vast majority of the Moby pronunciation data is phonemic, meaning that it would be sufficiently precise for speech recognition applications, but not for high quality speech synthesis.
For more information please see the Wikipedia article on the subject.
To install this database with PostgreSQL, download the SQL database dump from import_moby.sql.bz2 and issue the following shell commands from the Linux command line.
createdb wixml bunzip2 import_moby.sql.bz2 psql -f import_moby.sql -d wixml vacuumdb -z -d wixml

As the diagram shows, the entities extracted from Moby can be divided into two levels. The first level contains lists of words, hyphenations and pronunciations. The second level records the associations between these entities. Since all these associations are many-to-many in nature, it is necessary to store words, hyphenations and pronunciations in different tables.
While it might seem unexpected that a hyphenation could be associated with more than one word, this is because the word list contains both plain and accented versions of many words. In some cases there is more than one accented version of a word.
Note that there are no separate tables for language or proper noun classes in this database design. This information is stored in an array of file numbers associated with each word record. For example, a word is french if the file number array contains 10 and german if it contains 5. Of course some words could be present in both languages. The following table lists all of the relevant file numbers and the predicates they represent.
| Key | Note | Key | Note |
|---|---|---|---|
| 5 | German words | 27 | American place names |
| 6 | Japanese words | 28 | official Scrabble supplement |
| 7 | Spanish words | 29 | official crossword words |
| 8 | Italian words | 31 | single words |
| 10 | French words | 32 | often misspelled words |
| 23 | acronyms | 33 | common names |
| 25 | common words | 35 | compound words |
| 26 | male names | 36 | female names |
Database development, website content and design, copyright 2009 by Andrew Smith