Using PDI Metadata Injection to import CSV files

Many of you may already have received a bunch of CSV files with the mission of importing them into a database.

As you also may know, PDI (Pentaho Data Integration) provides such functionality with ease of implementation.

However, when we talk about making things more dynamic, reusing artifacts and mainly when the set of files is really big, this task becomes massive.

Here at Oncase, we often receive this kind of task and, because of that, we decided to dive deep into the already well documented PDI Step named “Metadata Injection“.

What does it do?

Taking as example, the task explained above, what should we manually do when files land in our machine?

  1. Create the database in the given DBMS with DDL customer sent to us;
  2. Create, for each file sent, one ETL stream with, at least, “CSV Input” / “Select Values” / “Table Output” steps – if you want to send it to database. Without forgetting that, in every single stream of these, you have to preview CSV files so Kettle can infer data types and you will also have to deal with data formats and stuff like these.

So, i guess this is what Pentaho guys thought when they developed the “Metadata Injection” Step:

-Why do ETL developers have to build so many repetitive files or steps? How can we serialize this process?

Now, with Metadata Injection we’re able to develop, with simple files, the serialized process of importing many files or doing whatever we want with them, by using a template transformation.

If we already have the metadata into the DBMS schema, why do we have to manually compute or input them into as many “CSV Input”s as the number of files we have? We don’t. :)

I am gonna give you an example of how it can be done.

What we have?

A thousand CSV Files with a Create DDL for every single file we have.

What do we have to deliver?

A populated database. :)

 

Let’s do it.

First of all, we have to create the database and tables included in the SQL script that customer sent to us. We’ll need to do some automatic queries on this database. Yes, without data yet.

Second, be organized. Be sure that your files are all together, in order to get things easyer. In this example i am only using 5 files, as follows:

In this example, i’m using the same files names as the tables names they will fill. So at this point i’ve already created TB_ARRECADACAO, TB_CNAE, TB_GRUPO_CNAE, TB_GRUPO_RECEITA and TB_RECEITA in my database.

So let’s explain the top level tasks we want to do:

    1. Load the files names from a given folder;
    2. For each file name found we will:
      1. Get table – with same name as file – Metadata;
      2. Inject this metadata into template transformation.
      3. This template transformation will output CSV stream data into the given table in Database.

 

First file, the top level job.

In this job, all i need is a transformation to load the files names and then copy them to resultset using the “Copy Rows to Result” step.

In this case i used a Data Grid to provide the files names, but it can be simply replaced by a “Get File Names” step or whatever you need to feed the stream.

 

Then, in the job entry named “set-var and meta-inject”, we have to configure some settings in the Advanced and Parameters tab so the entry can be executed for every input row and also for the parameters to be passed down the job. As shown in picture below.

 

It’s self explanatory, our first transformation “Copy rows to result” will have to provide us one field named curr_table, which have the file name without extension. We will then pass this file name to next job as a parameter using the same name as the field provided.

The second job, which is gonna be executed for every file

Once we passed a parameter down, as shown above, we have now to configure its capture. It can be done by right-clicking the canvas and then choosing Job Setings. You will only have to fill the parameter name, in Parameters tab. It’s good for testing, if you fill a Default Value. Just be careful, cuz you can sometimes think that your job works because it’s got a default value.

Now, let’s show it!

The main reason why i have a job here, and not a transformation, is to set a variable. Why?

At this point, i am not able to use parameters over the template transformation called by another transformation. In other words, if i have a parametrized transformation calling, through Metadata Injection, another transformation, the only artifact who is able to read parameters is the first. And i am also not allowed to inject data – or metadata – in Table Output step.

I could still pass the table name that i’m to populate by enabling, in CSV Input, the “Add filename to result” property and then handling the filename to send it to Table Output and using it in “Field that contains name of the table” Property. I just preferred to use this scenario.

So i am now just setting a variable and calling a transformation which is gonna use it. The variables are set as following:

 

Now that we have the variable set for each file we run the transformation, let’s explain

 

The Metadata Injection Transformation

What we have to do is to load some streams with metadata we’re going to inject.

As shown in picture, i have:

  • retrieves from db – Makes a query for a single dummy register, left outer joining the ${current_table} that is now set as variable – don’t forget to enable “Replace Variables in script” property;
  • catches metadata – It’s a “Metadata Structure of stream” step that outputs the metadata values as stream fields;
  • Value Mapper – gives datatype Data, the format “yyyy-MM-dd”, it’s the format customer sent.
  • headerpresent – i set a boolean value to the “Header Present” property of the “CSV Input” step from the template transformation;
  • Get Variables – gets ${current_table} and ${Internal.Transformation.Filename.Directory} as fields within stream;
  • filename – builds complete path by operating path and filename;
  • csvDelimiter – gives the delimiter char of the given CSV File.

Before talking about the Metadata Injection Step – aka “injects it!” -, i’m gonna show you another file, the template transformation.

That’s it!

As shown, you will only have to place a CSV Input with nothing set and a Table output with your connection, ${current_table} variable and the “Truncate table” property set to true.

And now, as promised, let’s get back to the Metadata Injection step:

In this step, we should inform which is our template transformation, in order to get loaded the available steps to inject our metadata. The available steps are available in documentation (http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection).

After loading the transformation, their supported steps will be loaded into a grid.

There, we can provide our specific values from our Source Step/Source Field to each Property of loaded steps. Those that we don’t provide data, will assume default values.

In the case of the fields area, in CSV Input, it is expected a resultset containing one or many rows, representing the fields that are gonna be loaded. There is where magic really happens!

Files set, all we have to do now is to run our main job and use the data from the chosen DBMS!

Files available at: http://oncase.com.br/artifacts/metadata_example.zip

Once we’ve always used environment variables to build file paths, these files can be placed – together – everywhere that it’s supposed to work.

Some tips for customizing into your database/filesystem:

  • Pay attention at database connections and the query syntax;
  • Pay attention at CSV files Date format;
  • The files are configured to look at path+’/files_to_import/’+filename+’.txt’; to find CSV Files;
Marcello Pontes
Analytics Architect & Front-ender at Oncase |