Loop Execution of Data Transformation

Written by

Attention: This approach is obsolete. Jobflows introduced in later version of CloverETL are prefered for this kind of task.

Case study description

Czech Insolvency Registry (http://isir.justice.cz) basically contains data about economic subjects that entered insolvency and have financial difficulties with paying off their debts. The registry allows everybody to download data using public SOAP Web Service. It can be done manually or automatically with the right software.

CloverETL can easily help with the automatically download that would save time and technical difficulties. CloverETL graph can get required data by calling the web service, processes data and store it in required format. Unfortunately the Registry’s web service is very poorly designed. The service doesn't give you current status of each of the economic subjects, but provides the whole history of the required company. Therefore we have to download not only the current information we need but the whole information since the year 2008 (the registry foundation). That is a lot of data to process – actually thousands of log records for each company! Moreover the Registry’s Web service „GetIsirPub0012“ provides only maximum of 1000 records per one call. If one company has few thousands of records you have to undertake more calls.  So we have to download data in thousand-records bunches, but we don't know in advance exactly how many of these bunches (records) there are for each company. That makes the whole process quiet difficult.

But solution with CloverETL is simple. CloverETL Server provides features “graph event listener” and “groovy task” that help us with all the above described challenges. Firstly, we will of course design a CloverETL graph that processes for the beginning just one thousand –record bunch of data (see picture bellow).

Transformation graph

This graph has a parameter „startID“ which has value “0” by default. If we want to process 1000 records starting let’s say from no. 2541, then  start ID will be startID=2541, and the first downloaded record will be identified by. If we run graph without parameters, it'll download and process first thousand of records (no. 0 – 999).

Loop Execution

Graph also contains couple of components to store ID of last downloaded record so that  the next bunch to download may use the last ID  as the startID. It will be automatically stored to graph parameters as lasted parameter. It can be done by in-line Java code in Reformat component:
String id = GetVal.getString(source[0],"id");
getGraph().getGraphProperties().setProperty("lastID", id );

The loop

The graph we designed must be executed n-times to download and process all records. At first we don't know how many times, but we know, that we can stop the downloading process as soon as there are no more records to read. It means we can stop the process as soon as  “started” and “lasted” are equal.

How to achieve such loop?

Graph event listener

To achieve the automatic loop, for the graph that we designed and described previously, we'll define graph event listener for “FINISHED_OK” graph event on CloverETL Server. So every time transformation finishes without error („FINISHED_OK“), listener will trigger task that we selected. We need to specify this tasks now. Since we want to  execute the same graph repeatedly, we have to specify “execute graph” task. This task will repeat executing the graph indefinitely. However we need to stop this loop at some point. We need to “break the loop” when the startID and the lastID parameters are equal. Therefore it is actually better to create “groovy task” instead of „execute graph“.

Groovy task

Groovy is scripting language with Java syntax. In addition, groovy scripts may access java objects and use java libraries. See Groovy project site http://groovy.codehaus.org/ for more details.

We'll create a simple groovy script which decides whether execute the graph again or not. To decide it, we'll need to get graph properties from the finished graph. These properties are accessible by calling method event.getProperties().

Then, we'll need to execute graph using CloverETL Server Java API. It's done by calling method serverFacade.executeGraph().

Script may return String value which is stored in „Task history log“.

// these variables are predefined:
// sessionToken
// event
// serverFacade

import com.cloveretl.server.persistent.RunRecord;
import org.apache.log4j.Logger;
import com.cloveretl.server.api.*;
import org.springframework.web.context.WebApplicationContext;
import org.springframework.web.context.support.WebApplicationContextUtils;

Logger log = Logger.getLogger("groovy-ISIR-graphEventListener");

Properties eventProps = event.getProperties();
log.info("event properties: " + eventProps);

// get lastID and startID from previous graph execution
String lastIDString = eventProps.getProperty("lastID");
String startIDString = eventProps.getProperty("startID");
long lastID = Long.valueOf(lastIDString);
long startID = Long.valueOf(startIDString);

// lastID and startID from last graph execution are equal – break the loop
if (lastID == startID)
return "no more records to download";

// prepare startID which will be passed to next graph execution
Properties properties = new Properties();
properties.setProperty("startID", lastIDString);

String SANDBOX = eventProps.getProperty("SANDBOX_CODE");
String GRAPH = eventProps.getProperty("GRAPH_FILE");
GraphExecutionCommand graphExecutionCommand = new GraphExecutionCommand(
null, SANDBOX, GRAPH, null, null, null, true, properties, null, null);
Response respExec = serverFacade.executeGraph(sessionToken, graphExecutionCommand);
String result = "graph "+SANDBOX+"/"+GRAPH+" executed: "+respExec.getBean();
return result;

Graph results

All graph results for each bunch of data are stored to only one CSV file. They are always added, so don’t worry there is no danger that some of them will be overwritten :-). So when the whole batch of transformations is finished, we have only one CSV file with all processed records. Or if somebody wishes we can consolidate records and store them directly into database where the data can be stored in more friendly and usable format.

Posted on April 28, 2010.

Where to go next