Architecture overview diagram (SVG) (PNG)
A general non-functional requirement of all components is that precision is to be valued over recall. True-negatives have a less damaging effect on perception than false-positives.
The overall methodology is to follow an iterative development methodology with fixed-time development periods of 4 weeks each (apart from the first cycle). We will aim to get something integrated and working in the first cycle and then to improve on different areas of the functionality in the remaining 6. The schedule for each cycle will be decided by the whole team a fortnight ahead of it's start, to assess any prerequisite work that needs doing.
In cycle 1, the system integrated PDFBox and Oscar3 to produce standoff annotations from PDF theses, and then transformed these (using XSLT) into RDF/XML. A corpus of 60-70 PDF theses was collected in a subversion repository.
Cycle 2 looked at issues that arose around parsing of PDF files - a set of filters were developed to avoid these problems. A short list of 10 theses to analyze in more depth was drawn up. Early potential problems in Oscar3 output were identified (particularly misrecognition of proper name initials as chemical elements).
Whole thesis corpus will be classified according to subject area, using some predefined upper ontology of chemistry (prob. from the RSC). A decision was made not to deal with OCR'd documents. The short list will be updated with these two things in mind.
A recent branch of Oscar3 that includes confidence scores in the standoff annotation file will be integrated into the document processing program.
The thesis short list will be analyzed for structure; taking account of what sections are a) prose with chemistry b) data c) metadata rich text d) not useful to us ;-)
An automated document structure analysis program will be started, by attempting to parse tables of contents.
Areas of work: