Parallel Ecore parsing in EMF
Nowadays, as XML has emerged as a de-facto standard for semi-structured documents it is often the core of modern web service architectures. But processing
XML documents often has a bad reputation regarding time and space performance. Although there are always optimizations emerging - trying to address
issues from different perspectives - none has been completely satisfactory. One
very promising, but also very complex approach is not to optimize the linear
parsing any further but to make use of todays modern multi-core computing
hardware and try to split the task of parsing into parallel processes.
In our project we explore the idea of parsing XML in parallel, based on related
research and develop an experimental approach for generated Ecore models. A
parallel approach to improve parsing performance can be achieved in several
ways. One approach would be to divide the complete parsing procedure into
different stages. Each stage can therefore be executed by a different thread or
processor and a substantial speedup may be the result. However one drawback in
this scenario is the additional complexity due to the need for controlled memory
access, synchronisation and/or load-balance. A more reasonable approach in our
eyes would be a data parallel parser. This way the XML file would be divided
into a number of chunks and each processor or thread could work independently.
After each chunk is parsed successfully the results are merged.
Ecore
The core of the EMF framework includes a meta-model called Ecore which is used to describe models as well as providing runtime support for the models. It is based on the XML Metadata Interchange format and can be viewed as an implementation of a subset of the class diagram of the MOF metamodel or the UML metamodel. The EMF generator uses Ecore as the primary source of information when producing code.Parallel Parsing - Approaches
The concept of parallelism can be used in different levels of granularity. Parsing solutions are often hybrid
variants of the following three base concepts:
A pipelined approach divides the parsing into multiple stages which then are
assigned to one of the available CPUs. While this approach can be efficient on a
lower number of cores, scaling and balancing to more cores can be difficult due to
communication and synchronization bottlenecks. Pipelined algorithms typically
use memory to transfer data between stages, which means that one stage has
to write data to memory and the succeeding stage must then read that data
once again. That is one reason why such systems are usually designed for a fixed
number of cores, and therefore not designed to scale if additional processing
resource would become available.
A Task parallel approach aims to split the work into separate tasks which can be performed concurrently. The distinction to the pipelining approach is that, while in the former the output of one stage serves as the input of another, in this strategy tasks are mostly independent from another. Again this approach is limited in flexibility and scaling, an implementation is specifically written to use a fixed number of cores.
A data-parallel approach tries to divide the input data into pieces, where each piece then is assigned a different core. This core then takes care of all the processing for that piece. Regarding an XML document, each core would parse a separate piece of the XML stream independently and after all cores have finished, the results would be merged back together. Such an implementation could scale better since adding an additional core would only require to divide the input data into one more piece. The vast issue is that all the pieces have to be independent of each other - it must be possible to parse a piece without any information from any other piece and since XML is a sequential format this poses a major problem.
A Task parallel approach aims to split the work into separate tasks which can be performed concurrently. The distinction to the pipelining approach is that, while in the former the output of one stage serves as the input of another, in this strategy tasks are mostly independent from another. Again this approach is limited in flexibility and scaling, an implementation is specifically written to use a fixed number of cores.
A data-parallel approach tries to divide the input data into pieces, where each piece then is assigned a different core. This core then takes care of all the processing for that piece. Regarding an XML document, each core would parse a separate piece of the XML stream independently and after all cores have finished, the results would be merged back together. Such an implementation could scale better since adding an additional core would only require to divide the input data into one more piece. The vast issue is that all the pieces have to be independent of each other - it must be possible to parse a piece without any information from any other piece and since XML is a sequential format this poses a major problem.
Our approach
The used java SAXParser in EMF already keeps the memory consumption while parsing very low, so we decided to tackle the speed performance. Also we set our
main focus on keeping the modifications within the EMF as low as possible, so the focus was concentrated on modifications on the existing source code. The basic
structure of our parallel approach is split into 3 phases:
- Splitting a large Ecore file into multiple parts: There are many possibilities like tags, attributes or simple content which have to be considered. One option is to use extensive backtracking and communication to resolve this issue but this would cause massive overhead and would probably void any speed up achieved by parallel processing. So instead of simply dividing the file in equally sized pieces a decomposition based on the logical structure seems to be the crucial and necessary part for parallel parsing approaches. Since our approach on parallel parsing was to use as much original code as possible we decided to split the initial file into multiple smaller files but to maintain a still parseable Ecore structure. That means that each partial file contains the information needed (e.g. namespace information) to parse it independent from the other ones.
- Parsing the smaller chunks in parallel threads: The parsing is the most expensive stage in our 3 phase approach. Since we did not want to rewrite the whole existing Eclipse SAX-Handler to keep the error sources to a minimum, our approach uses the same parsing code as EMF. Out of this reason we had to build the parallel parsing around the existing SAX code and the corresponding handlers. With the input of a split Ecore file we use the implementation of asynchronous tasks available in Java 7. For every part a ’Future’ object is created and added to a collection. The ParserRunable executes the parsing method for a given input in a background thread and the result is saved in a resource reference. These Runables are executed asynchronous to speed up the whole process.
- Finally the parsed sections need to be merged together: Every single split segment has all the start and end-tags. This way we can merge all the parsed resources together. We start with the first two results and compare which elements in their tree structure are equal. If we find equal elements, we go one step deeper in the hierarchy and check again for every element. If we have an element that is not present in the current end solution we add it. When we are done with all our segments, the resulting resource has the whole Ecore file represented as objects in memory.
Performance Evaluation
The resulting benchmarks, executing on 1, 2 or 4 threads, show that executing the parsing of an Ecore file in parallel does indeed bring noticeable performance increases.
There is still room for improvement though. The correct splitting of an XML document is not a trivial part and that topic alone can consume a lot of time and research. The outcome of this first phase determines the further improvements, well balanced parts can result in a good performance, while on the other hand the performance may not increase at all if the initial file can not be split into equal chunks.
Furthermore for our benchmarks we used generic generated test models. Although these can be considered valid and correct there may be far more complex models which need a lot more time for parsing or cannot be split and merged back together in a simple and time effective way. Such models would have to be analysed and taken under special consideration to make a general statement about the validity of our approach.
Performance Evaluation
The resulting benchmarks, executing on 1, 2 or 4 threads, show that executing the parsing of an Ecore file in parallel does indeed bring noticeable performance increases.
There is still room for improvement though. The correct splitting of an XML document is not a trivial part and that topic alone can consume a lot of time and research. The outcome of this first phase determines the further improvements, well balanced parts can result in a good performance, while on the other hand the performance may not increase at all if the initial file can not be split into equal chunks.
Furthermore for our benchmarks we used generic generated test models. Although these can be considered valid and correct there may be far more complex models which need a lot more time for parsing or cannot be split and merged back together in a simple and time effective way. Such models would have to be analysed and taken under special consideration to make a general statement about the validity of our approach.
Furthermore for our benchmarks we used generic generated test models. Although these can be considered valid and correct there may be far more complex models which need a lot more time for parsing or cannot be split and merged back together in a simple and time effective way. Such models would have to be analysed and taken under special consideration to make a general statement about the validity of our approach.