An Overview of the Sitecore Data Exchange Framework
Sitecore released the first version of the Data Exchange Framework last Summer, when Sitecore 8.2 was released, and have already released two updates, most recently releasing 1.2 in December. They built the Microsoft CRM connector using it, but besides the guide that comes with the DXF documentation, there's not much that explains how to use it and without building anything else, there isn't actually much you can do with it.
The guide walks you through creating a new File End Point and Pipeline Step, but actually doesn't do a great job with it. The code provided with the guide, loads the entire file into memory, and then builds a custom Array provider to access each column of a record, despite having the forethought to include a checkbox that allows you to specify that the first record has column names.
Over the past few months, I've been extending the framework to add additional capabilities to make it easier to integrate other data sources with Sitecore and xDB without having to write any custom code. I've fixed the file provider, implemented a RSS reader provider and even built a generic database reader which can work with SQL, OLE or ODBC to process the results of a custom configured query. I'll post details about those and other modifications I've done or am planning to do in the future.
Today, I figured I'd give a good overview of how things are supposed to work when designing pipeline Batches with DXF and what is available out of the box with both the core DXF framework and the components the OOB Sitecore provider gives you. To make it easy to follow along, here's a diagram that shows the typical pipeline flow for a pipeline Batch that reads data from one system and saves it another.
The first part of any good data routine is actually going and fetching data. OOB, DXF provides nothing that helps feed the pipe. The guide instructs you how to build a custom file end point and read pipeline step, but in most cases you'll need to go custom here.
So when building your own Read Pipeline Step, keep in mind you should try to keep the process streaming. If you try to load everything into memory, that will limit your ability to deal with larger data sets. If you have access to a stream, consider using "yield return" to ensure you're not loading more data that you're processing at a time.
The other thought is what format to save the data in. DXF only requires you create an IterableDataSettings object which can literally take any Iterable object. While you can use the property reader to get an simple POCO objects, or follow the guide and buildyour own accessor objects, I recommend mapping records to a string dictionary, as it's fairly lightweight and easy to manipulate, and you can have your Accessors configure what "keys" to use to access values. You could use Item Model, which itself is ready a Dictionary of objects, but that gives you some other fields related to managing Sitecore Items.
One note to mention is you can save your data to a queue instead of the IterableDataSettings plugin. The existing queue implementation is actually an in memory c# Queue, so I see little advantage to it, but extending the framework to use a real out of process queue may greatly increase the scalability of the process.
DXF ships with an out of the box pipeline step that will iterate through data and execute another pipeline. This is the step that allows you to go from the macro (pulling data sets) to the record level (processing individual records). Lucky for us we don't need to build anything to get this working.
One note on the out of the box step. It actually iterates through the records in a single thread. While that may be what you want, one of the things I've been planning on doing is replacing this step with a custom version that does it in a thread pool, with configurable thread pool settings. Without an asynchronous approach, you'll be limited to how fast you can load data, so you might have to do this anyway if you're trying to improve load performance.
Now that we're in our own record processing pipeline, the first thing we need to do is identify the item that we want to map our source to. The DXF Sitecore Providers provide two types of Resolvers. A Sitecore Item Resolver and a xDB Contact Resolver. The whole idea here is to use a field from the source data, see if an existing record exists in Sitecore and then load that as the target item. If it doesn't exist, it can also create the target item as a new item.
One note on the DXF Sitecore Provider for Sitecore Items is that it has no concept of language version. So if you have the language of the item in the source data, you'll need to extend this to take the language field as a parameter and save the Sitecore Item Properly. You'll actually even need to override the standard SitecoreItemRepository with one that can persist data to a specific language version. I'll write a post on that separately.
Now that we have resolved our target record, we want to map values from our source record to the target record, using configurable mapping rules which allow us to model the source and target fields as "Value Accessor Sets" and then map field to field. You may have to build custom Accessors if you're writing custom objects to the iterator, to allow people to configure what fields can be accessed and mapped from the item.
The DXF Sitecore Provider provides Value Accessors for both Sitecore Items and XDB Facets, which are pretty straightforward to use. The Facet one is a bit tricky, as there are one's to deal with some of the collection based facets like Emails.
Finally with the target ready to go, we can call a Save/Update Pipeline step to persist the target item to wherever it needs to go. The DXF Sitecore Provider comes with Steps to persist Sitecore Items and xDB contacts, but they have the same language issue that the resolver steps do, but more on how to fix that in a future blog post.
So that's a pretty typical DXF Pipeline Process. You can add additional steps as needed. For example I've implemented a pipeline step that enrolls a contact in an Engagement Plan, which I can configure after the contact is persisted to xDB. By keeping steps distinct and focused on one thing, it improves your ability to reuse them and create new configurations for dealing with different types of data without having to write a lot of code.
Over the next few blog posts, I'll take a deeper dive into some of the custom providers and components I've built on top of DXF.