CloverDX is a new name for CloverETL Learn more
Ever wondered what to do with those annoyingly slow operations inside otherwise healthy and fast transformations? You’ve done everything you could do to meet the processing time window, and now there’s this wicked API call that looks up some data, or a calculation that just sits there and takes ages to complete, record by record, dripping like water from an old faucet. And yet, all that sits on a beefy machine that’s capable of running dozens of these operations.
So naturally, you’d go from this:
To something like this (old fashioned now!):
This approach gets you the desired increase in throughput, for sure. You split the stream into four parallel branches in a round robin, wait for those slow (now parallel) operations to finish and then collect everything back into a single file.
There are numerous drawbacks to branching as a way of parallel execution:
Do these troubles ring a bell? I bet they do.
Let’s look at Data Partitioning, a feature introduced in CloverDX Server Corporate version 4.2 and Cluster. It’s an elegant answer to the above issues for performance scaling.
With Data Partitioning, you don’t create additional branches. Neither do you have to jump to a completely different design philosophy, convert to a different job type, use add-ons or even a different engine.
You simply specify the number of instances you want a particular component or bunch of components to spawn. That’s it. In the background, CloverDX will start multiple workers to handle the operation in parallel for you.
The beauty lies in the simplicity of changing the level of parallelism and keeping the transformation clean without duplicating components and their configurations.
Data Partitioning actually solves performance scaling issues on multiple fronts:
Data Partitioning can easily speed up your processing times by factors counting in tens or twenties. It makes it extremely easy to try and check what happens when you execute operations in parallel. Let’s look at a few examples where this is particularly useful.
However, no matter how great the above examples might turn out, don’t miss the last chapter of this blog for a list of things to consider when going parallel.
Data Partitioning has been around for a while now in CloverDX Cluster. With the release of 4.2 in June 2016, we made this available to Server Corporate as well.
So, to get Data Partitioning all you need is to upgrade your Corporate Server to the latest version and read on to learn how to use it.
You can also try it on our CloverDX Server Demo instance. Go to Try CloverDX to get the latest Designer and use the built-in Server examples to play around.
It is NOT available in Designer running local projects (you can design but it won’t run) nor is it available in the Community Edition.
You will need to grasp a bunch of concepts borrowed from Cluster to use Data Partitioning but none of it is rocket science. Here are a few basics:
Obviously, not everything can be run in parallel and you don’t always want to overload your servers with arbitrarily greedy transformation graphs. So consider these:
That’s all for this little introduction to the fabulous world of turning regular transformations into parallel heavens.