Monday, July 28, 2014

Parallel Ecore parsing in EMF

Parallel Ecore parsing in EMF

Nowadays, as XML has emerged as a de-facto standard for semi-structured documents it is often the core of modern web service architectures. But processing XML documents often has a bad reputation regarding time and space performance. Although there are always optimizations emerging - trying to address issues from different perspectives - none has been completely satisfactory. One very promising, but also very complex approach is not to optimize the linear parsing any further but to make use of todays modern multi-core computing hardware and try to split the task of parsing into parallel processes. 


In our project we explore the idea of parsing XML in parallel, based on related research and develop an experimental approach for generated Ecore models. A parallel approach to improve parsing performance can be achieved in several ways. One approach would be to divide the complete parsing procedure into different stages. Each stage can therefore be executed by a different thread or processor and a substantial speedup may be the result. However one drawback in this scenario is the additional complexity due to the need for controlled memory access, synchronisation and/or load-balance. A more reasonable approach in our eyes would be a data parallel parser. This way the XML file would be divided into a number of chunks and each processor or thread could work independently. After each chunk is parsed successfully the results are merged. 


Ecore

The core of the EMF framework includes a meta-model called Ecore which is used to describe models as well as providing runtime support for the models. It is based on the XML Metadata Interchange format and can be viewed as an implementation of a subset of the class diagram of the MOF metamodel or the UML metamodel. The EMF generator uses Ecore as the primary source of information when producing code.



Parallel Parsing - Approaches

The concept of parallelism can be used in different levels of granularity. Parsing solutions are often hybrid variants of the following three base concepts:

A pipelined approach divides the parsing into multiple stages which then are assigned to one of the available CPUs. While this approach can be efficient on a lower number of cores, scaling and balancing to more cores can be difficult due to communication and synchronization bottlenecks. Pipelined algorithms typically use memory to transfer data between stages, which means that one stage has to write data to memory and the succeeding stage must then read that data once again. That is one reason why such systems are usually designed for a fixed number of cores, and therefore not designed to scale if additional processing resource would become available.

A Task parallel approach aims to split the work into separate tasks which can be performed concurrently. The distinction to the pipelining approach is that, while in the former the output of one stage serves as the input of another, in this strategy tasks are mostly independent from another. Again this approach is limited in flexibility and scaling, an implementation is specifically written to use a fixed number of cores.

A data-parallel approach tries to divide the input data into pieces, where each piece then is assigned a different core. This core then takes care of all the processing for that piece. Regarding an XML document, each core would parse a separate piece of the XML stream independently and after all cores have finished, the results would be merged back together. Such an implementation could scale better since adding an additional core would only require to divide the input data into one more piece. The vast issue is that all the pieces have to be independent of each other - it must be possible to parse a piece without any information from any other piece and since XML is a sequential format this poses a major problem.


Our approach

The used java SAXParser in EMF already keeps the memory consumption while parsing very low, so we decided to tackle the speed performance. Also we set our main focus on keeping the modifications within the EMF as low as possible, so the focus was concentrated on modifications on the existing source code. The basic structure of our parallel approach is split into 3 phases:

  • Splitting a large Ecore file into multiple parts: There are many possibilities like tags, attributes or simple content which have to be considered. One option is to use extensive backtracking and communication to resolve this issue but this would cause massive overhead and would probably void any speed up achieved by parallel processing. So instead of simply dividing the file in equally sized pieces a decomposition based on the logical structure seems to be the crucial and necessary part for parallel parsing approaches. Since our approach on parallel parsing was to use as much original code as possible we decided to split the initial file into multiple smaller files but to maintain a still parseable Ecore structure. That means that each partial file contains the information needed (e.g. namespace information) to parse it independent from the other ones.

  • Parsing the smaller chunks in parallel threads: The parsing is the most expensive stage in our 3 phase approach. Since we did not want to rewrite the whole existing Eclipse SAX-Handler to keep the error sources to a minimum, our approach uses the same parsing code as EMF. Out of this reason we had to build the parallel parsing around the existing SAX code and the corresponding handlers. With the input of a split Ecore file we use the implementation of asynchronous tasks available in Java 7. For every part a ’Future’ object is created and added to a collection. The ParserRunable executes the parsing method for a given input in a background thread and the result is saved in a resource reference. These Runables are executed asynchronous to speed up the whole process.

  • Finally the parsed sections need to be merged together: Every single split segment has all the start and end-tags. This way we can merge all the parsed resources together. We start with the first two results and compare which elements in their tree structure are equal. If we find equal elements, we go one step deeper in the hierarchy and check again for every element. If we have an element that is not present in the current end solution we add it. When we are done with all our segments, the resulting resource has the whole Ecore file represented as objects in memory.

Performance Evaluation 

The resulting benchmarks, executing on 1, 2 or 4 threads, show that executing the parsing of an Ecore file in parallel does indeed bring noticeable performance increases.



There is still room for improvement though. The correct splitting of an XML document is not a trivial part and that topic alone can consume a lot of time and research. The outcome of this first phase determines the further improvements, well balanced parts can result in a good performance, while on the other hand the performance may not increase at all if the initial file can not be split into equal chunks.

Furthermore for our benchmarks we used generic generated test models. Although these can be considered valid and correct there may be far more complex models which need a lot more time for parsing or cannot be split and merged back together in a simple and time effective way. Such models would have to be analysed and taken under special consideration to make a general statement about the validity of our approach.

Monday, July 07, 2014

Improved OCL Expression Evaluation on Henshin Generated State Spaces

Improved OCL Expression Evaluation on Henshin Generated State Space

Model checking faces rising relevancy in many scientific and industrial areas. This can also be observed within the field of Model Engineering where OCL expressions support formulation of versatile constraints for models and furthermore also for state spaces emerging from incremental appliance of model transformations on models. But checking OCL expressions over state spaces means verification of all constraints in every single intermediate state. This clearly leads to tremendous efficiency and performance issues when accomplished with primitive approaches. Within this article we will demonstrate possible improvements to overcome these issues.

Background

Generally very different ambitions exist for using OCL expressions that need efficient analysis. Basically at least two different research directions can be identified. The first one deals with OCL expression evaluation within IDEs to support developers working on huge heterogeneous models with often many and complex OCL expressions. Here efficiency is important because humans working with IDEs usually expect fast responses on their actions. It is in no way satisfying for modellers when changes on their models or OCL expressions trigger evaluation cycles taking minutes to present results. Normally this should be accomplished in background and real-time to provide immediate feedback on the impacts of modellers actions.

The second one deals with evaluation of less complex OCL expressions on big homogeneous data like for example databases or state spaces representing possible program execution paths. The importance of efficiency arises here from the size of the model and not from the real-time feedback to the user. Otherwise consistency evaluation of growing models gets fast intractable and another approach must be taken.

Implementation

There exist for both theoretical directions practical solutions but implementations can vary strongly. For implementing real-time evaluation of complex models with complex constraints, approaches like graph pattern matching proved usable. In this case constraints are defined as rules in form of graph patterns which are matched against the underlying models that have to be evaluated and also exist in form of graphs. Another possibility is to keep meta-data about the model structure and state, optimize OCL expressions before usage by splitting them up or simplifying them and use meta-data in addition to the model applied change events, to get a high-performance solution. Another approach for evaluation of constraints on big datasets works with a conversion from OCL to the Monoid Calculus. The models to be evaluated are stored within databases and the transformed constraints are furthermore expressed as related stored procedures. This supports efficient execution within database environments.

Generally important for nearly all approaches is to consider, that not the whole model should be evaluated every time a change is applied to it but to start with a clean state and iteratively retrieve all change events on the model to estimate, whether specific constraints could be effected or not. This assures that the model is always kept in a valid state and that efficiency is not affected by growing data amounts.

Laboratory Environment and Approach

We will give insights into the environment and our approach for the laboratory part of our work. As environment the Eclipse Modelling Framework was chosen together with the integrated Eclipse OCL implementation. Eclipse OCL provides a feature promising iterative and efficient evaluation of OCL constraints on EMF models called Impact Analyzer. We decided to chose an approach based on it with Henshin for state space generation. The combination of these formed the base of our laboratory work. Setting up the environment is mostly straight forward but one important special adaptation is necessary to reproduce the exact environment we used. Some dependencies must be manually added to the project because they aren't available out of the box. The following figure provides detailed information about which ones must be added.




Gossiping Girls Problem

We chose the Gossiping Girls Problem as a starting point of our research. It is based on a group of girls, in which each can have an arbitrary number of secrets. A girl can start a phone call with any other girl of the group. The girl knows after this phone call all unknown secrets of the other girl and vice versa. The question to answer is how many calls are at most necessary until each girl knows all existing secrets of the others. The related EMF Ecore model for the problem is provided in the following graph.





To generate a state space with Henshin, suitable transformation rules which represent the necessary behaviour for the problem must be provided. We present them in the following graph.


 

Impact Analyzer on Henshin State Space

For integration of the functionality from the Impact Analyzer into Henshin generated state spaces and the immanent OCL expression evaluation, the overall Henshin internal processing must be adapted and changed. The first problem is that Henshin, at the beginning generates by iterative applying rules the complete state space and afterwards evaluates OCL constraints on it as a whole. This technique is especially bad for state spaces growing much bigger than available memory can hold. For this reason we changed the behaviour to an iterative one. So the state space is expanded by applying transformation rules once and afterwards triggering OCL expression evaluation on the expanded set. The previous set can be reduced to free space for new expansions. Furthermore this stepwise approach also supports integration of the Impact Analyzer which works in an event listening and model intrusive way.

Results and Benchmarks

Unfortunately the desired positive effects and efficiency improvements promised by the Impact Analyzer could not be achieved during our test setup. This is due, however, to the way Henshin stores the models resulting from a single step exploration of the state space. We suppose that Henshin generates a new set of models during the state space generation procedure, thus invalidating the change event listeners the Impact Analyzer depends on to compute necessary re-evaluations for a given OCL expression. Therefore the following benchmarks will not provide any performance improvements but due to the changed stepwise processing of OCL expression evaluation the time consumption of executions rose dramatically.
Without Impact Analayzer
With Impact Analayzer


Further Work

The next steps of research to reach the desired goal in this context would deal with the following issues:

  1. Detailed analysis of the underlying EMF model data structure managed by Henshin in relation to the EMF model of the Gossiping Girls Problem.
  2. Code analysis of the stepwise exploration processing to understand how Henshin decides, which transitions will be added to the model of the previous step.
  3. Investigate how Henshin maintains the data structure between different exploration steps.
  4. Try to find a solution for a working integration of the Impact Analyzer with the gained knowledge of the previous steps.

Sunday, July 06, 2014

Code highlighting/completion for MocOCL Web-Interface

The motivation of this project is the enhancement of the user experience of the existing MocOCL user-interface by extending it with two features, code-completion and syntax-highlighting. The existing web-interface offers only a simple HTML text field where the user specifies the desired cOCL expressions that should be evaluated by MocOCL. The evaluation request is then sent to an application server which handles the processing of the evaluation. Unfortunately, the user has to know exactly the structure of the underlying model to formulate these cOCL expressions correctly.

Code-Completion

The code-completion feature suggests keywords or properties to the user which are likely to occur at the current position within the cOCL expression. The keywords that are contained in the suggestion are specified by a grammar. This grammar is defined in a file using Xtext. Therefore we have to extract the keywords from an existing Xtext grammar and provide them as content for the code completion. How you can parse an Xtext file is documented below.

Syntax-Highlighting

Syntax-highlighting states that essential keywords of the cOCL language should visually stand out from the rest of the code to provide a better overview on the written cOCL expression. It should support the users when they read a specified cOCL expression to realise its structure and to understand the expression more quickly. The keywords are defined in an Xtext grammar file as mentioned previously. The syntax-highlighting was realized using CodeMirror, a JavaScript framework for advanced editor functionality.

Implementation of Syntax-Highlighting using CodeMirror

CodeMirror is a multifunctional text editor implemented in JavaScript with its main focus on providing advanced editing functionality for editing code. CodeMirror can be integrated into a simple HTML page like follows:

<! -- CodeMirror -->
<link rel="stylesheet" href="codemirror/lib/codemirror.css">


<! -- Highlighting -->
<script src="codemirror/lib/codemirror.js"></ script>
<script src="codemirror/mode/mococl/mococl.js"></ script>


<! -- Completion -->
<link rel="stylesheet" href="codemirror/addon/hint/show-hint.css">
<script src="codemirror/addon/hint/show-hint.js"></script>
<script src="codemirror/addon/hint/mococl-hint.js"></script>


CodeMirror comes along with a simple directory structure. The main functionality is located in the lib directory in the file codemirror.js. The file codemirror.css defines the default style for the editor. The MocOCL mode, which defines the actual language to highlight, is specified in mococl.js. The optional files mococl-hint.js, show-hint.js and show-hint.css located in the addon/hint directory are required to offer code-completion.
The enhancment of a HTML text area element is done as follows:

<label for="codeArea">Code:</label><br />
<textarea id="codeArea" rows="20" cols="100"></textarea>

<script>

var editor = CodeMirror.fromTextArea(document.
             getElementById("codeArea"),{
    mode: "mococl",
    value: "Start Code...",
    extraKeys: {"Ctrl-Space": "autocomplete"},
    textWrapping: true,
    lineNumbers: true,
    smartIndent: false
});
</script>


The definition of the extraKeys triggers the auto-completion implemented in the JavaScript files which requests the keywords/properties via a HTTP-GET request from the application. The an example for the auto-completion in action is shown. below.



Xtext parsing at the application server

The server-side implementation is responsible for parsing the provided Xtext and Ecore file as well as for processing the HTTP-GET requests it receives from the client-side.

Since the xtext-parsing is not done within the Eclipse environment, an Xtext-standalone version must be used and  dependencies that would otherwise already be provided by Eclipse must be added. This concerns libraries from org.eclipse.emf, org.eclipse.xtext, guice, guava and javax.inject. The following code demonstrates how a standalone parser can be obtained:

public class XtextParser{

    @Inject
    private IParser parser;
   
    public XtextParser() {
        setupParser();
    }
   
    private void setupParser() {
        Injector injector = new XtextStandaloneSetup().createInjectorAndDoEMFRegistration();
        injector.injectMembers(this);
    }
}


After initialisation the parser can be used to parse Xtext files. It generates the abstract syntax tree (AST) and returns the root element.

public EObject parse(Reader reader) throws IOException {
    IParseResult result = parser.parse(reader);
    if (result.hasSyntaxErrors()) {
        throw new ParseException("Provided input " +
            "contains syntax errors.");
    }
    return result.getRootASTElement();
}


Starting from the root element we recursively obtain all contained objects and check if they are instances of AssignmentImpl, KeywordImpl or ParserRuleImpl. These instances contain the required information which is collected and sent to client for the auto-completion.





Tuesday, July 01, 2014

Refactoring UML models: Co-refactoring fUML conform class diagrams and activity diagrams

By Kristof Meixner and Sebastian Geiger on 30. June 2014

In this blog post we will introduce our project and the results of our work with refactorings. The work has been done as part of the lecture "Advanced Model Engineering". Our task was to evaluate and create model refactorings for fUML conform UML models. As part of this work it was also required to formulate pre- and postcondition with OCL for each of the refactorings that we created which ensure that the refactorings preserve model semantics.

But lets start with some basic terms first. What is model refactoring? What is fUML and what is model execution? And what is semantic preservation?

Surely everyone who is familiar with software development knows refactoring. When we talk about refactoring we usually think of source code written in languages such as Java or C and actions such as renaming a variable or a class or the extraction of code into a new method. Model refactoring is basically the same, just that we don't refactor source code but models. The actions include simple actions such as renaming a Class, a Property (the UML term for fields) or an Operation (the UML term for Methods), but also more complex ones such as encapsulate property. As part of our work we implemented the following refactorings:
  • Rename class
  • Rename property
  • Rename operation
  • Encapsulate property
  • Pull up property
  • Pull up operation
  • Extract super class
There exists different research about model refactorings but so far it focused on the refactoring of the static parts of UML, in particular class diagrams. Our work also takes the dynamic parts into consideration, that is activity diagrams. Why this is important we will see, after a look at fUML.

fUML is a subset of UML which defines additional semantics such that it is possible to execute fUML conform UML models. This means if we consider refactorings of executable fUML models we must consider both class and activity diagrams. Since class and activity diagrams are connected we cannot simply change the class diagram without performing a change in the activity diagram. This procedure is called co-refactoring. As such each of our refactorings done to a class diagram performs a co-refactoring of related activity diagrams if this is necessary.

The co-refactoring of activity diagrams means that the semantics of the affected activites can potentially change. This requires work to ensure that the model is not changed in incorrect ways or ways that do not preserve the semantics of the model. There are two approaches to tackle this problem. The static approach evaluates the model before and after the refactoring with OCL constraints. Such pre- and postconditions define constraints that the model must satisfy. With these constraints it can be ensured that the model has a certain state that prevents incorrect refactorings.

The other approach is a dynamic one. The model is executed before and after the refactoring and the execution traces are compared as well as the model state of the refactored and non-refactored models.

For each of the above mentioned refactorings we have implemented the required transformations of the model including the co-refactoring of activities as well as pre- and postconditions. In order to evaluate our refactorings and explain them in our report we have created a comprehensive example model including class and activity diagrams. We have written a test suite with test cases for each refactoring that executes the defined pre- and postconditions and performs the model refactoring.