![]()
| Home | Work | Resources | People & Partners | Organization |
Utilika Foundation’s flagship resource, PanLex, is a 17-million-word, 6-thousand-language database relying on thousands of dictionaries, thesauri, glossaries, and wordlists. Want to translate “vóhkoohéh” from Cheyenne into kinyaRwanda? With PanLex, you can. Implemented in PostgreSQL on Red Hat Enterprise Linux, and with a set of Perl-scripted utilities and a nascent API, PanLex is based on pioneering research begun at the University of Washington’s Turing Center under the leadership of Oren Etzioni.
In March 2011 the foundation announced the availability of paid internships to collaborate on the development and deployment of PanLex. The projects were designed to appeal primarily to advanced students in computational linguistics, computer science, informatics, and related fields. PanLex is managed in Berkeley, California, but interns can work on their projects anywhere in the world, and at flexible periods from May 2011 through December 2012. The foundation provides project support, server access, data, a forum for inter-project discussion, and cooperation with schools’ requirements for academic credit.
The internship projects are described below. If a project is still open, the estimated effort it requires and the compensation we expect to agree to pay to the intern who carries it out are also shown. If any of the open projects interest you, we look forward to your application.
| lexical acquisition (Apply) | OCR project | Christa Mowry, University of Washington. Optical character recognition, whether commercial (ABBYY FineReader, Adobe Acrobat, OmniPage, Readiris, Presto, etc.) or open-source (Tesseract, OCRopus, etc.), could make masses of printed lexical data accessible for use in PanLex. But OCR’s quality limitations (e.g., Rodrigues et al. 2010, pp. 14, 15, 29) have caused some lexicagraphic projects (e.g., Digital Dictionaries of South Asia) to work without it. Test and evaluate existing OCR engines trained on mixed-script and highly diacritic-containing text in page-image resources collected for PanLex. Write a report on your results and on the potential for practical use of OCR in the acquisition of PanLex lexical data from page images. |
|---|---|---|
| lexicographic parsing project | Yuancheng Tu, University of Illinois at Urbana-Champaign. PanLex is a graph of lexical translations and synonymies attested by thousands of monolingual, bilingual, and multilingual resources. Different, inconsistent, and complex source formats (e.g., Pool 2011) complicate the conversion of these resources’ data into normalized database records (cf. Rodrigues et al. 2010). As an alternative to the rule-based approach of Baldwin, Pool, and Colowick (2010), develop and test a system that uses prior conversion decisions of PanLex’s human editors as training data to learn jointly to identify entries and to identify and classify fields within entries, resolving some common punctuation ambiguities exhibited by such resources (cf. Srikumar et al. 2008). | |
| crowd- sourcing project | [Open] Page-image lexical data can also be made accessible to PanLex via crowdsourcing or another flavor of human computation, either one minimal fragment at a time (cf. Duo Lingo, multilingual CAPTCHA) or in fragments of a page or more (cf. Distributed Proofreaders). Design and user-test a prototype Web application to obtain crowdsourced data for PanLex via user conversion, or verification of automated conversions, of lexical translations displayed as images of printed words. 300 hours, $6,500. | |
| game project | Vamshi Ambati, Carnegie Mellon University. Well-designed word-based games powerfully motivate volunteer users to provide data (e.g., ESP Game) and contribute to good causes (e.g., UN World Food Program’s Free Rice). Design and user-test a Web prototype of a PanLex-based single- or multi-player game of skill, leveraging and improving PanLex’s data in thousands of languages. | |
| infra- structure (Apply) | translation inference project | Jason Shaw, University of Washington. Mausam et al. 2010 developed the SenseUniformPaths algorithm (p. 628) for translation inference and applied it to the smaller set of data from which PanLex originated. They showed the resulting PanDictionary more effective in lexical translation than a large Wiktionary. Reimplement the SenseUniformPaths algorithm in PanLex so it can provide translation inference and be a baseline in future competitions among inference algorithms. |
| grammar engineering project | Michael Wayne Goodman, University of Washington. Messages containing words and phrases (lemmas) from PanLex, translated lemma-by-lemma, can be understood often, though not always (Everitt et al. 2010). What if we enhanced such “lemmatic communication” with a little syntax? Using a multilingual computational grammar and translation platform, such as the LinGO Grammar Matrix customization system (Bender et al. 2010), define starter grammars for a set of syntactically diverse, morphologically bare pseudo-languages (cf. Drellishak 2009) and populate their lexicons with PanLex. On the basis of this experience, propose a strategy for eliciting pseudo-language grammars from language experts and feeding their lexical type assignments back into PanLex. 350 hours, $6,500. | |
| graph visualization project | Viet-An Nguyen, University of Maryland, College Park. PanLex can be interpreted as a graph with about 17 million lexemes (in about 6000 languages) connected to each other via shared meanings. Navigating this graph computationally, when practical, is still opaque. Intuitive user-directed navigation through visualizations of the graph would be a valuable feature (cf. Visual Complexity). Design, prototype, and user-test a Web application that permits graphical exploration of the lexeme-language-meaning graph of PanLex. | |
| applications (Apply) | image search project | David M. Howcroft, Ohio State University. PanImages, described in Etzioni et al. 2007, Colowick 2008, and Colowick and Pool 2010, demonstrated the practical value of PanLex-type data for image search on the Web. It had a UI in 53 languages and permitted input in 171 languages. The application, no longer operational, relied on a subset of the data now in PanLex. Implement and user-test a light version of PanImages as an application prototype calling the PanLex API. Extend the API as needed. Omit the crowdsourcing features of PanImages, but remove its limitations on the set of query and UI languages supported. |
| mobile app project | Brandon Loudermilk, University of California, Davis. It is challenging to make PanLex data and services mobile. Mobile platforms offer limited support for I/O in multiple scripts; small displays and slow connections make long lists of languages, translations, etc. impractical. Design and user-test a prototype PanLex-based Web translation application that detects at least iPhone and Android clients and adapts appropriately to their capabilities (cf. TeraDict and the Panlingual Translator demo). | |
| social app project | Li Wang, University of Melbourne. Social media contain some communications in telegraphic and list-based formats (tags, ratings, topic lists, product features, micro-messages, lists of personal interests, etc.). Such communications might be amenable to PanLex-based translation across thousands of languages, allowing even monolingual speakers of tiny languages to participate. Design and user-test a PanLex-based social application for any translingual purpose (use your imagination!) running on an existing social-networking or commerce platform (Wikipedia, Facebook, Twitter, Second Life, Digg, YouTube, eBay, Global Voices, etc.). | |
| custom project | If you have another project idea, let us know. |