Putting the AnnoMarket platform through its paces
Some colleagues from University of Sheffield came to us last month asking “Hey, aren’t you the guys doing language processing in the clouds? Can you please help us?“. We like to be nice, so we replied “Sure, what’s the problem?“. Turns out they needed to process a large number of documents (about 10 million) in a hurry, and push all the generated annotations into a Mímir index.
Now, this request arrived shortly after we had finished preparing the first platform prototype for the project’s first year review. We had all this great pile of software ready and rearing to go, and we needed
victims beta testers. It was an opportunity too good to miss, and we couldn’t say “no“. The rest of this post describes what we did and how it all worked.
The dataset to be processed consisted of 9,551,404 documents, of which about 8.5 millions were PubMed abstracts, just over 200,000 Cochrane abstracts, and almost 1 million patents. The documents were already stored on S3, as .zip and .tgz archives, which the AnnoMarket platform can handle natively, so we didn’t need to do anything special.
The patent documents required a bit of pre-processing because they were supplied as large XML files including the contents of 100 documents each. This makes them hard to process (as the very large files would require large amounts of RAM), and difficult to distribute to a swarm of processing nodes. A simple Groovy script took care of that and we then had individual documents as XML files, packaged into large .zip archives, stored in S3.
The processing pipeline was supplied by the ‘customers‘, and it consisted of an information extraction pipeline, generating semantically enriched annotations. The application is a GATE pipeline, relying mainly on semantically aware gazetteers and JAPE grammars.
The output required was a Mímir index. However, our colleagues were using some of the new features that are only available in the trunk version of Mímir, which is designed to become version 5.0 when it ripens. Because of that, we had to create a custom Mímir server image for them, based on a developer build.
Setting up the job
In total, the job included 527 input archives, each of them containing documents of the same type (PubMed, Chochrane, or patents). The annotation pipeline included some logic to treat different document types differently, so it required the mime-type to be correctly indicated for each input archive. We didn’t like the idea of setting this up manually so it was, once again, time to enlist the help of Groovy. We wrote a simple script that uses the AnnoMarket REST API to configure the annotation job inputs.
Running the job
Once everything was set up, we were able to press the ‘Run‘ button. The whole processing took 1.5 days, and consumed 13 days of machine time, (or 73 CPU days1). We used a processing swarm of between 5 and 15 annotator nodes2, and one sole Mímir indexing node. In total we processed just over 43 GiB of (uncompressed) data, which produced a 40GiB Mímir index.
The whole process took place as expected, there were no errors, and no compute nodes became over- or under-loaded. I think we shall have to call this one a success!
What’s it all about?
The generated Mímir index is used to power an analytics tool that supports the work of immunology experts. An example output is the graphic below: for each matrix cell the depth of colour indicates the number of documents in which the terms from the X and Y axes co-occur. In this case, the terms are ontology instances, so all different lexicalisations of the same concept are counted together.
(1) Each processing node runs 6 parallel threads.
(2) We were trying to estimate the indexing capacity of a single Mímir server. The processing swarm started with 5 nodes, and we kept increasing its size whilst monitoring the Mímir server. At around 15 annotator nodes, the data throughput coming into the Mímir server started to exceed 1Gbit/s and stabilised at around 1.2 Gbit. We were surprised to see that AWS instances can get network throughput higher than 1 Gbit, but we decided to stop pushing our luck. At this point, the CPU utilization of the Mímir server was averaging 50% (while indexing documents annotated by 15 nodes).