![]() ![]() Gnumeric, KSpread or OOCalc are examples of the former. It is easiest to start storing the words in a spreadsheet or database. Ways of storing data See also: Speling format The Wiki of the Association for Computational Linguistics has a List of resources by language which often contains useful links (often clearly marked for whether the resources are Free Software or Proprietary). Some of this can be quite complicated, but if there is a site which has this information freely available in a standardised format, contact someone on the Apertium team, and they'd be glad to help you write a script, or to write the script for you.įor Wiktionary, contact Francis Tyers who has some pre-rolled scripts for retrieving morphological information and Wei En who has written a crawler for obtaining such information. For example, for Faroese, nouns come with declination, which can be automatically extracted by means of scripts. ![]() So, for instance, while it might be ideal to have citation sentences for each word giving its typical use in context, this may be a luxury you cannot as yet afford.įor some languages, the free dictionary project, Wiktionary has reasonable data. It is better to aim at a good basic foundation which will allow you to develop it later and fill in the gaps as time and manpower present themselves. You will not be able to create your language's equivalent of the Oxford English Dictionary overnight, and there is no point in trying. If you are in this position, it is important to remember that "the best is the enemy of the good". ![]() Breton, Kashubian) the amount of material and the number of helpers may be small - many of the lesser-used languages in KDE, for instance, only have one or two people working on them. English, German, Spanish) there may be a range of material available, and there may also be a significant number of people willing and able to work with you on collecting and systematising the data. From this you can decide on the particular elements of information that need to be noted down for each word in order to capture its meaning and variants. You will have a good idea of how the language works from your own knowledge of it, and from reviewing published materials (eg dictionaries, grammars) about it. ![]() In the meantime, maintain contact with the Apertium team, and discuss any issues that have arisen in regard to your data collection. Once initial contact has been made with the Apertium team, it is better to aim at collecting a sizeable wordlist (1,000-2,000 words) and coming to some preliminary decisions on how the language's sentences are structured. However, this is not recommended - your views of how the data should be segmented may change, leading to wasted work. It is possible to collect a small amount of linguistic data, and start testing it with Apertium. You do not have to be a first-language speaker to collect and systematise the data, but you should have a reasonable knowledge of the language, and be working in consultation with first-language speakers. It is unlikely that this data will be appropriate "as is" for use in Apertium, and it will need a greater or lesser amount of revision first. In other words, you definitely cannot just start copying published dictionaries or other material wholesale into your data store. Some practical suggestions on how to build some starter wordlists can be found at Building dictionaries, but if you feel that this is too technical, just ask one of the Apertium team to put together something like this for you.Ī crucial point here is that the data must either have been gathered from scratch, or must be available under a license which is compatible with Apertium's GPL. This data might consist of wordlists, word-corpora derived from web-crawlers such as Crubadán, grammar notes, existing translation of open-source software such as KDE or GNOME, etc. These dictionary files are not discussed further on this page - more information on their layout and structure is available at the HOWTO.īefore these files can be produced, you need a collection of linguistic data which can be inserted into them. : a list which maps the Afrikaans words in af.dix to their equivalent English words in en.dix.apertium-en-af.en.dix.xml: a list of English words and their variants.apertium-en-af.af.dix.xml: a list of Afrikaans words and their variants.For instance, for the English-Afrikaans pair, these would be: This page gives a brief overview to the kind of data and resources that can be useful in building a new language pair for Apertium, and how to go about building them if they do not already exist.Įach Apertium language pair requires 3 dictionary files. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |