- Learning Pentaho Data Integration 8 CE(Third Edition)
- María Carina Roldán
- 798字
- 2025-04-04 17:49:50
Reading a simple file
In this section, you will learn to read one of the most common input sources, plain files.
For demonstration purposes, we will use a simplified version of sales_data.csv that comes with the PDI bundle. Our sample file looks as follows:
ORDERDATE,ORDERNUMBER,ORDERLINENUMBER,PRODUCTCODE,PRODUCTLINE,QUANTITYORDERED,PRICEEACH,SALES
2/20/2004 0:00 ,10223,10,S24_4278 ,Planes ,23,74.62,1716.26
11/21/2004 0:00,10337,3,S18_4027 ,Classic Cars ,36,100 ,5679.36
6/16/2003 0:00 ,10131,2,S700_4002,Planes ,26,85.13,2213.38
7/6/2004 0:00 ,10266,5,S18_1984 ,Classic Cars ,49,100 ,6203.4
10/16/2004 0:00,10310,4,S24_2972 ,Classic Cars ,33,41.91,1383.03
12/4/2004 0:00 ,10353,4,S700_2834,Planes ,48,68.8 ,3302.4
1/20/2005 0:00 ,10370,8,S12_1666 ,Trucks and Buses,49,100 ,8470.14
3/11/2004 0:00 ,10229,6,S24_2300 ,Trucks and Buses,48,100 ,5704.32
7/19/2004 0:00 ,10270,6,S12_1666 ,Trucks and Buses,28,100 ,4094.72
8/25/2003 0:00 ,10145,4,S32_4485 ,Motorcycles ,27,100 ,3251.34
Before reading a file, it's important that you observe its format and content: Does the file have a header? Does it have a footer? Is this a fixed-width file? Which are the data types of the fields? Knowing these properties about your file is mandatory for reading it properly.
In our sample file, we observe that it has one row per order, there is a one-line header, and there are eight fields separated by commas. With all this information, along with the data type and format of the fields, we are ready for reading the file. Here are the instructions:
- Start Spoon and create a new Transformation.
- Expand the Input branch of the Steps tree, and drag and drop to the canvas a Text file input step.
- Double-click the Text file input icon and give the step a name.
- Click on the Browse... button and search for the sales_data.csv file.
- Select the file. The textbox File or directory will be temporarily populated with the full path of the file, for example, D:/LearningPDI/SAMPLEFILES/sales_data.csv.
Note that the path contains forward slashes. If your system is Windows, you may use back or forward slashes. PDI will recognize both notations.
- Click on the Add button. The full file reference will be moved from the File or directory textbox to the grid. The configuration window should appear as follows:

Adding a file for reading in a text file input step
- Click on the Content tab, and fill it in, as shown in the following screenshot:

Configuring the content tab in a text file input step
By default, PDI assumes DOS format for the file. If your file has a Unix format, you will be warned that the DOS format for the file was not found. If that's the case, you can change the format in the Content tab. If you are not sure about the format of your file, you can safely choose mixed, as in the previous example, as PDI will recognize both formats.
- Click on the Fields tab. Then click on the Get Fields button. You will be prompted for the number of lines to sample.
The Get Fields functionality tries to guess the metadata but might not always get it right, in which case you can manually overwrite it.
- Click on Cancel. You will see that the grid was filled with the list of fields found in your file, all of the type String.
- Click on the Preview rows button and then click on the OK button. The previewed data should look like the following screenshot:

Previewing an input file
There is still one more thing that you should do; provide the proper metadata for the fields:
- Change the Fields grid as shown in the following screenshot:

Configuring the fields tab
- Run a new preview. You should see the same data, but with the proper format. This may not be obvious by looking at the screen but you can confirm the data types by moving the mouse cursor over each column.
- Close the window.
This is all you have to do for reading a simple text file. Once you read it, the data is ready for further processing.
It's important to highlight that the existence of the file is not mandatory when you are creating the Transformation. It helps, however, when it's time to configure the input step.
When you don't specify the name and location of a file, or when the real file is not available at design time, you are not able to use the Get Fields button, nor to see if the step is well configured. The trick is to configure the step by using a real file identical to the expected one. After that, change the configuration for the name and location of the file as needed.
After configuring an input step, you can preview the data just as you did, by clicking on the Preview rows button. This is useful to discover if there is something wrong with the configuration. In that case, you can make the adjustments and preview again, until your data looks fine.