Here's How to Embed Python Scripts into the CloverDX Data Pipeline
Easy learning curve and a rapidly growing ecosystem of libraries makes Python, (along with R) a favored choice for data prep and analytics. Python is no doubt one of the drivers behind modern self-service approach by data scientists and business analysts who can do much more data massaging on their own, without needing any outside help.
However, I've learned the hard way that doing all your data massaging in Python will eventually lead you into a trap of reinventing the wheel. Parsing files, handling exceptions, or scheduling scripts in cron are just a few tedious jobs you shouldn't be doing yourself in 21st century. Combine your Python skills with a data integration platform like CloverDX and you can focus on writing Python business logic and analytic pieces while leaving out the boring (yet necessary) stuff. CloverDX can take care of data parsing and formatting, connecting to on-premise and cloud data sources, jobflow orchestration, automation, monitoring, scaling out, etc. CloverDX is designed to be the data backbone of an organization and Python-based analysis and data manipulation can surely be part of it.
Let’s say I have some logic that I wrote in Python (let’s call it calculate_age.py - yes, amazingly it calculates person’s age from the date of birth!) and I want to use this logic inside CloverDX.
Normally I would have to use Reformat component and write it in CTL or Java but with the help of Jython – a 3rd party library for integrating Python in Java – combined with the provided PythonBridge class (see below) I can use Python directly within CloverDX!
You need to have Jython library .JAR linked to your CloverDX project. Right-click a project in Navigator and select “Properties”. Go to Java Build Path > Libraries and then click “Add JARs” or “Add External JARs” (depending whether you have the JARs in your project or elsewhere).
While on the Properties screen, check that you have Java SE Development Kit (JDK) installed. JDK is required for the PythonBridge class (see below). If it says “JRE System Library [jdk1.7.0_xx]”, it’s ok.
Python Development in Designer
If you want to create your Python Scripts in CloverDX, we recommend to install PyDev, an open-source plugin for Eclipse. It is a Python IDE (IntegratedDevelopmentEnvironments) and it allows Python editing with features like code-completion, refactoring, quick navigation, templates, code analysis and many more.
Writing Python Script
We’re writing a simple Reformat transformation in Python instead of the default CTL or Java.
Reformat takes one input record, processes it using transform() function (this is what we’ll write in Python) and creates one output record (with potentially different structure).
How does it work? In Reformat component we use PythonBridge class – a custom piece of Java code that delegates the Reformat’s transform() function to the Python script.
When writing your Python script, keep the following in mind:
You must define transform() function – the main function for Reformat
For each incoming record, transform() method will be executed once, producing one output record
Input and output records are available via “input_record” and “output_record” variables
For logging purposes you can use “logger” object
This is my python/calculate_age.py script adapted to work as a Reformat transformation:
from datetime import date
#read fields using using methods from clover_utils.py
name = get_string_field("Name")
surname = get_string_field("Surname")
birth_date = get_date_field("BirthDate")
country = get_string_field("Country")
#start legacy process
res = legacy_process_person(name, surname, birth_date, country)
#set fields by legacy process output
set_field("Age", res) set_field("FromUSA", res)
country=="United States of America" ]
Notice the strange functions such as get_string_field(), get_date_field(), set_field(), etc. These are defined in python/clover_utils.py and are just simple data access functions working with the “input_record” and “output_record” objects.
The code goes through three stages:
Reading record values into variables (name, surname, etc.)
Running legacy_process_person() function that returns a new set of values
And finally writing it back as output (set_field() calls).
Of course, this is a truly basic example, but that’s all the magic!
Make sure you have Jython installed (both 2.x and 3.x will work with this example).
Make sure you have JDK as your default Java environment (Window > Preferences > Java > Installed JREs).
Your CloverDX project should have Jython JAR on it’s build path (right-click on project in Navigator > Java Build Path > Libraries: jython-standalone-x.y.z.jar should be there)
Open graph/PythonIntegration.grf and Run it.
No need to worry if you encounter this error: Failed to install ”: java.nio.charset.UnsupportedCharsetException: cp0 It is a known bug in Jython for Python 2.7 and 3.4 (http://bugs.jython.org/issue2222).
Ignore the message (recommended)
You can set Default VM arguments to “-Dpython.console.encoding=UTF-8” in Window/Preferences/Java/Installed JREs, select current JDK and click Edit
Or use earlier version of Python
Reformat (Python) Subgraph Explained
As you can see, I’ve wrapped the Reformat with PythonBridge into a reusable subgraph called Reformat(Python). This way I get not only a neat icon for the component, but also a user-friendly interface for setting PythonScriptURL and PythonScript parameters making reusing the “Python enabled Reformat” much more transparent - you just provide the Python script via the parameter!
Notice that PythonScriptURL and PythonScript parameters in Reformat(Python) subgraph are marked as unused in Outline. That’s not a bug. It’s the PythonBridge Java code that actually uses those parameters and unfortunately Designer can’t see into the Java code so it thinks it’s not used anywhere.
More on PythonBridge
If you wonder what actually PythonBridge is and you’re famililar with Java, you can adapt it to your needs. It’s a Java class that we created specifically for the use with the reformat component. Keep in mind it’s not a standard part of CloverDX.
How does it work?
First, it looks for graph/subgraph parameters PythonScriptURL or PythonScript
If PythonScriptURL is set, it gets priority over PythonScript (inline script)
For each record:
It creates variable bindings for input_record, output_record and logger
It executes the script’s transform() function
If there’s an error in your Python script, it will fail the transformation and report the error
To use Python in other CloverDX components you will need to adapt the PythonBridge to match the interface of the particular component. We’ll cover this in some future blog post.
Python is a great tool for quickly implementing complex business logic or advanced analytics procedures. When you combine it with a platform like CloverDX that takes care of automating the pipeline and has built-in functionality for standard data manipulation, you can focus only on solving things that matter and streamline the rest.
Data integration software and ETL tools provided by the CloverDX platform (formerly known as CloverETL) offer solutions for data management tasks such as data integration, data migration, or data quality. CloverDX is a vital part of enterprise solutions such as data warehousing, business intelligence (BI) or master data management (MDM). CloverDX Designer (formerly known as CloverETL Designer) is a visual data transformation designer that helps define data flows and transformations in a quick, visual, and intuitive way. CloverDX Server formerly known as CloverETL Server) is an enterprise ETL and data integration runtime environment. It offers a set of enterprise features such as automation, monitoring, user management, real-time ETL, data API services, clustering, or cloud data integration.