The open-source OAQA project is dedicated to open advancement in the engineering of question answering systems - language software systems that provide direct answers to questions posed in natural language. Since 2007, Carnegie Mellon has collaborated with IBM Research and other universities to advance the state of the art in language systems architecture, component algorithms, and end-to-end performance by establishing a shared vision, architectural commitments, and process to prioritize and guide an agile approach to developing high-performance applications.
- Vision. We believe that research and development of complex language technologies can be accelerated significantly by adhering to a set of architectural principles coupled with a formal, iterative development process.
- Commitments. In order to define and effectively search the space of possible solutions (software systems) for a task, a team must commit to a shared architecture, resources, tools and metrics - at both the component and system level.
- Process. In order to make rapid progress and, each system iteration must undergo a formal build / test / evaluate / analyze / prioritize cycle to keep the team focused on improvements that have the greatest impact.
The Watson question answering system developed at IBM research is the first highly-visible example of what is possible with this approach. The OAQA approach is the foundation for many sponsored research and development projects in the Language Technologies Institute at CMU, not only for question answering but also for related language applications such as Multimedia Event Detection.
At the base of the Open Advancement stack we have the CSE-Framework, the framework provides a common ground for experimentation and analysis.
On top of the CSE-Framework, we've built task level resources that provide pipelines, tools and metrics for building specific language processing components: BaseQA for question answering, etc.
To benefit developers who are already familiar with UIMA framework, we have developed a tutorial on CSE in alignment with the examples in the official UIMA tutorial: http://github.com/oaqa/oaqa-tutorial/wiki/Tutorial.
Analogous to a typical UIMA CPE descriptor, components, configurations, and collection readers in the CSE framework are declared in extended configuration descriptors7, which are based on the YAML format: http://github.com/oaqa/uima-ecd.
Global resource caching. As these online resources, e.g. biomedical ontologies, EntrezGene, MeSH, etc., might sometimes become temporarily unavailable or have their contents updated, which makes it difficult to reproduce specific experimental results, we implement a generic resource caching strategy as part of the CSE framework implementation: http://github.com/oaqa/resource-wrappers.
The component ranking strategy can be configured by the user; several heuristic strategies are implemented in the open source software: https://github.com/oaqa/bagpipes.
The implemented components, benchmarks, task-specific evaluation methods are included in domain-specific layer named BioQA, which was plugged into the BaseQA framework: http://github.com/oaqa/bioqa.
- OAQA tutorial
- Wiki content, structure, and general rules
- OAQA development teams
- Etiquette for working on another team's repo
- OAQA development model/developer manual
- OAQA development coding conventions
- Z. Yang, E. Garduno, Y. Fang, A. Maiberg, C. McCormack, and E. Nyberg. Building optimal information systems automatically: Configuration space exploration for biomedical information systems. In Proceedings of the CIKM’13, 2013.
- Alkesh Patel, Zi Yang, Eric Nyberg and Teruko Mitamura. Building Optimal Question Answering System Automatically using Configuration Space Exploration (CSE) QA4MRE 2013 Tasks
- Elmer Garduno, Zi Yang, Avner Maiberg, Collin McCormack, Yan Fang, Eric Nyberg. CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration Unstructured Information Management Architecture (UIMA) 3rd UIMA@GSCL Workshop