|
|
| Who | When |
Messages | |
|
|
|
| Tom Gewecke
|
305
|
 |
|
02-21-2007 12:07 PM ET (US)
|
|
|
BisharatNet
|
306
|
 |
|
03-15-2007 10:25 PM ET (US)
|
|
The issue of machine translation (MT) for Yoruba and other African languages per /m303 and /m304 deserves a lot more attention. Good to see what LDC and a few other centers are doing, but one has the impression that if some serious resources were made available, it should be possible to have MT in a relatively short time. I am no expert, just looking at what is available now for language pairs like English <-> Chinese, or Japanese, or Arabic. Unfortunately there is not a lot of ready material to work with as Mike suggests, even for a language like Yoruba that has some literature. Don
|
BisharatNet
|
307
|
 |
|
03-15-2007 10:27 PM ET (US)
|
|
|
BisharatNet
|
308
|
 |
|
03-15-2007 10:33 PM ET (US)
|
|
The Common Locale Data Repository (CLDR) is accepting new locale data and, in the case of Yoruba which has a locale, additional information and corrections. See http://lists.kabissa.org/lists/archives/pu...forum/msg00581.html . Samuel Olamijulo mailed the following appeal out to several people and lists. I repost it here for the record. DonRespected Yoruba People and friends everywhere, good evening. Please read the request below. The time is short but the request appears of tremendous importance for future easier more effective Internet communication in Yoruba than at present. I earnestly BEG Yoruba people with Language, IT and other relevant competencies to urgently and openly discuss and cooperate on this one for Harmonious Competent input. I am a Yoruba Pediatrician and I BEG more competent volunteers to coordinate the discussion and submission of the best input possible on Yoruba. Thank you, Dr. Samuel Kayode Olamijulo
|
BisharatNet
|
309
|
 |
|
03-15-2007 10:37 PM ET (US)
|
|
Andrew Cunningham's response:Having a quick look at http://unicode.org/cldr/apps/survey?_=yo, it would appear that most of the work that needs to be done, is adding terminology, e.g. country names, language names, currency names, calendar/time terminology, etc. The basic character requirements have been done. Not sure about number formats, etc. Andrew
|
| Mike Maxwell
|
310
|
 |
|
03-15-2007 10:50 PM ET (US)
|
|
QT - BisharatNet wrote: > The issue of machine translation (MT) for Yoruba and other > African languages per /m303 and /m304 deserves a lot more > attention. Good to see what LDC and a few other centers are > doing, but one has the impression that if some serious resources > were made available, it should be possible to have MT in a > relatively short time. I am no expert, just looking at what is > available now for language pairs like English Chinese, or > Japanese, or Arabic. Unfortunately there is not a lot of ready > material to work with as Mike suggests, even for a language like > Yoruba that has some literature. Don Not sure which Mike that was (probably not me), but the amount of material in Yoruba is precisely the problem for statistically-based MT. The necessary resource is machine-readable bilingual text. I was at the LDC until a year or so ago, and at least at that point there was hardly any *monolingual* Yoruba text, much less bilingual text, in electronic form. At that point, the plans at LDC for getting electronic bilingual text were to buy printed newspapers and key them in. Primitive, to put it mildly. For Chinese, Japanese and Arabic, on the other hand, there is "tons" of bilingual text available on the Internet. And for many other major languages, there is at least monolingual text: Tagalog, Cebuano, other major Philippine languages; Swahili, and maybe Zulu (not sure about the other languages of South Africa, apart of course from Afrikaans); Hindi, Bengali, Tamil, many other Indic languages; Amharic and to a lesser extent Tigrinya; all European languages; Thai, Bahasa Indonesian, Vietnamese, Persian/ Farsi, and so forth. There are even some indigenous languages of the Americas that have some Internet text, such as Guarani and maybe some of the Quechua languages. FWIW, I suspect Igbo and Hausa are in the same situation as Yoruba, although I haven't checked lately. There is also rule-based MT. I believe the African Languages Technology Initiative, a group in Nigeria, was looking at that, but last I heard they didn't have any funding for MT. For the record, here's what was available in the way of computer-readable resources when I was at the LDC: http://lodl.ldc.upenn.edu/found.cgi?lan=YORUBAThe list of potential resources is a template we used; as you can see, it's virtually blank. Here's a more recent survey, done about a year ago: http://lodl.ldc.upenn.edu/LCTL/Yoruba_harvest.html-- Mike Maxwell maxwell@ldc.upenn.edu
|
BisharatNet
|
311
|
 |
|
03-15-2007 11:22 PM ET (US)
|
|
Dear Samuel, Andrew, all, Yes there was an effort last year that produced basic locale data for OpenOffice, which was then also rewritten to also submit to CLDR. This (along with locales for some other Nigerian languages including Hausa and Igbo) was facilitated by Alberto Escudero-Pascual and Louise Berthilson using the locale generator at http://www.it46.se/localegen/ . So there is something usable for basic localization. CLDR, however, is on a different site. See http://unicode.org/cldr/ - there is some introductory information. The way that CLDR is set up it is almost like a glossary of basic terms and it would help to have someone review those. See another view of the data at http://www.unicode.org/cldr/data/charts/summary/yo.html (this is different than the page Andrew gave, but with the same info in a different way). Not sure if there are established Yoruba words for some of the territories or languages listed, or how those will help localize, say, a cellphone or browser or web-page. Nevertheless, the Yoruba language experts should have a look. BTW, out of the total 2136 lines, the entire alphabet is listed on line 1629 (with auxiliary characters, such as used for loan words in line 1628). Don Osborn Bisharat.net PanAfrican Localisation project
|
| Dr. Samuel Olamijulo
|
312
|
 |
|
03-16-2007 05:38 AM ET (US)
|
|
Yoruba-English Bilingual Texts on Same of 1978 Pages For the special attention of all who are working on Machine Translation for Yoruba, I am sure you will find very useful this relatively recent Yoruba and English publication. Corresponding Yoruba and English translations are on each half of 1978 pages. The publishers must have it in digital format. Title: BIBELI MIMO HOLY BIBLE King James Version Produced year 2004 by Bible Society of Nigeria Tel: 234 1 545 7524 ; 587 6471 Website: http://www.biblesociety-nigeria.org/ Contacts: http://www.biblesociety-nigeria.org/nigeria-3.htm18 Wharf Road, Apapa Lagos, Nigeria Thank you. Dr. Samuel Kayode Olamijulo
|
| Mike Maxwell
|
313
|
 |
|
03-16-2007 06:49 PM ET (US)
|
|
QT - Dr. Samuel Olamijulo wrote: > For the special attention of all who are working on Machine > Translation for Yoruba,
I'm not sure that anyone is...
> I am sure you will find very useful this relatively recent > Yoruba and English publication. Corresponding Yoruba and English > translations are on each half of 1978 pages. The publishers must > have it in digital format.=20 > > Title: BIBELI MIMO =96 HOLY BIBLE King James Version
I neglected to mention in my earlier post that the bilingual text one uses for statistical MT should be in the same genre that one hopes to translate. So parallel news text if you hope to use MT for news, parallel text of computer manuals if you hope to use MT to translate computer manuals, etc. The Bible is indeed a source, and is available in print form (if not in electronic form) for most written languages. But it would probably only work well as an MT source only if you wanted to translate other texts of a religious sort. (I think one might be able to extract other kinds of information, e.g. about morphology, from a Bible; but then Yoruba doesn't have much in the way of morphology.) -- Mike Maxwell maxwell@ldc.upenn.edu
|
BisharatNet
|
314
|
 |
|
05-03-2007 12:56 PM ET (US)
|
|
Belated thanks to Mike and Samuel for the feedback. I think that advanced technologies such as machine translation in the case of Yoruba need some long term planning. Knowing the practical steps, such as the utility of parallel texts, and the paths, such as context-specific work, help.
While I think it's important to discuss these things, one side effect is that people sometimes get the impression that it's a project actively underway.
Ultimately if such projects - from basic issues like corpora to advanced applications - are to really progress, I think there would need to be more training of Nigerians in Nigeria in aspects of language and ICT. Not sure how much of that is going on already, but given how multilingual Africa is, any strategy to advance use of ICT should consider this aspect. Just a thought.
Don Osborn Bisharat.net PanAfricanL10n.org
|
BisharatNet
|
315
|
 |
|
05-03-2007 01:41 PM ET (US)
|
|
|
BisharatNet
|
316
|
 |
|
06-18-2007 11:02 PM ET (US)
|
|
|
| Remi-Niyi Alaran
|
317
|
 |
|
09-06-2007 11:24 AM ET (US)
|
|
I would like to share a writing system that I have developed for the Yoruba language. Called Yoruba FaYe [meaning "draw it so we can understand it"), it may also be extended to other languages. It comprises 48 characters = alphabet (38) and numerals (10). This is a smaller character set than the 62 symbols (26 capitals, 26 lower-case and 10 numerals) in the standard English language, which uses Roman script.
The present Yoruba script was developed by the missionary Ajayi Crowther in 1836. The Ajayi script features a standard Roman script with the addition of diacritical marks to reflect Yoruba tonality and accent, So the Ajayi script is larger than the 62 characters used in English. It has proved difficult to convert Yoruba into computer machine code because of the diacritical marks. Without the diacritical marks, it is very difficult for even accomplished Yoruba linguists to efficiently read or write Yoruba.
The FaYe system does away with diacritical marks altogether. It is phonetic, so that every sound in Yoruba is now represented by a unique character. Research indicates that humans can only delineate between about 34 unique sounds. Yoruba FaYe has a 38 character alphabet and it has a natural rhythm able to accommodate the sophistication of Yoruba's complex and rich oral literature.
It is hoped that the FaYe system helps in considerably improving the quantity and quality of literature available to document African Literary Heritage. You can view or download the FaYe system at www.ijebudrums.blogspot.com
|
| maxwell@ldc.upenn.edu
|
318
|
 |
|
09-06-2007 12:31 PM ET (US)
|
|
Quoting QT - Remi-Niyi Alaran <qtopic+15-KKgbRqJUAR8@quicktopic.com>: > Research indicates that humans can only > delineate between about 34 unique sounds.
What research is this? (Warning in advance: I don't think this is correct, although I suppose it depends on what you mean by "unique sounds".)
> The FaYe system...is phonetic, so that every sound in Yoruba is now > represented by a unique character.
I suspect you mean phonemic, not phonetic. It is certainly possible to write Yoruba (or any other language) phonetically, using e.g. the IPA system, but that typically multiplies the number of "unique sounds" over what a phonemic system would require.
Mike Maxwell CASL/ U MD
-------------------------------------------------------------- - This message was sent using IMP, the Internet Messaging Program.
|
| Adé
|
319
|
 |
|
09-06-2007 06:10 PM ET (US)
|
|
Rẹmi,
I see that you accounted for ọ and ṣ (that is s with sub-dot) but not ẹ. Also what about the nasal n sound?
Do you plan to submit your system to Unicode for inclusion in the Universal Text Format?
|
BisharatNet
|
320
|
 |
|
09-14-2007 09:18 PM ET (US)
|
|
Edited by author 09-14-2007 09:20 PM
Dear Remi-Niyi, Thanks for bringing the new FaYe system to our attention /m317. There have been several new alphabets discussed in recent years (one each in Senegal, Gambia and Cameroon), not to mention older African writing systems. Each one proposes new advantages but have there been any efforts to compare with the others? This is not to discourage you, just to ask. An alternative argument would be that even a sub-optimal but workable script that is already established is better to keep working with than to change. (A similar argument was made in the case of Bambara of Mali concerning several changes in the orthography over the last 30 years - people who learn one system have to learn the new one and old publications have to be revised, etc., even though the older systems were not bad.) In any event, I am sure that in the longer run, computer tools for transliteration would enable transforming text in a particular language from one script to another. So work on a proposed new script need not slow work in the existing one. Similarly, the current problems with diacritics on Latin characters should not be overstated. Many more complex scripts are already fully used on computers, the internet, cellphones, SMS. Computer systems are being improved in their ability to handle combining diacritics, complex scripts, etc. As for Adé's question about Unicode, my understanding is that a script needs to be pretty well established before getting into the pipeline. The Mandombe script used in parts of D.R. Congo (invented in the 1970s) is still not in process at all, as far as I know. Nor does it have an identifier code. Again, this is not to discourage you, but rather to outline the situation. On the other hand, I suppose that if the Nigerian government were to decide that FaYe is what it will use in all schools for Yoruba and whatever other languages, the importance of the new writing system for encoding in Unicode would be significant. This is a little bit like what happened with Tifinagh - there was a proposal to encode this ancient but still used script in Unicode/ISO 10646, but only when the Moroccan government decided it was going to use this in schools for Tamazight instruction did it get enough attention to finalize a version of the proposal for approval. Hope this is of help. Good luck. Don Osborn Bisharat.net PanAfriL10n.org
|
|
|