Parsing data from structured data sources.
Parsing and transforming structured data.
Introduction
I worked on this project a few years ago.
We had many Microsoft Excel sheets containing information about product versions. Each of these documents had many versions and each product line also had a document of its own.
Use cases
Users of this data wanted a centralised location from which they could search, view and visualize the data. If a new version of the document is released, or if an existing document is updated, the destination dataset should reflect those as well.
Considerations
- The data should be available for analysis - using interactive querying and/or visualization methods.
- The data should refresh automatically via batch jobs when there is new data. Additionally, this introduces the additional use case of detecting new data.
- The data should be secure and should be available to only a list of allowed individuals.
Extraction
Instead of focusing on the extraction first, I started to think of different ways in which one could store the data and then build a pipeline to read from the data sources.
Based on the considerations above, the storage destination should be persisted. Some choices were:
- Relational databases.
- Spreadsheets.
- Other semi-structured formats such as JSON.
I will discuss more on the storage destination in its own section below, and will now start discussing the design of landing the data to the destination.
Design choices:
- ETL tools.
- Connectivity from Microsoft Excel to Microsoft SQL Server (or any relational database).
- Custom scripting.
Off-the-shelf ETL tools.
I considered a host of options to reliably extract data from an XLSX document. Initially, I explored readily available products, such as contemporary ETL tools (Informatica et al.). I performed small, quick Proof Of Concept exercises to test the feasibility of the different tools. The tools were able to connect with the XSLX files and extract data, but I ran into the following challenges:
- The connectivity mainly depended on ODBC drivers, which could not be customized to suit the use case – such as parsing through different sheets in the same document or extracting cells based on specific conditions.
- The tools were able to successfully connect and extract entire sheets into relational database tables. However, the data has to be completely imported in its raw state as tables and that caused complexities. The raw state could not be completely transformed to the desired structure, once it was imported into the ETL tool.
- The customizations were too cumbersome, given the limited options within the tools. Also, some tools allow scripting to be embedded into the ETL pipeline, which does help, but that also led to the idea of scripting the entire solution end-to-end, which is what I ended up doing, since it was very customizable and, at the same time, easier to develop from scratch.
Connecting MS Excel to MS SQL Server directly
A second option was to try to connect to Microsoft SQL Server from Microsoft Excel. Microsoft has a plethora of options to do so. And all of these options work quite well. The problem for this use case is customization and the final structure of the dataset.
This option could have been pursued by an ELT pipeline/connection, where you land all the data as-is from Microsoft Excel to Microsoft SQL Server and then perform all the transformations inside MS SQL Server using ANSI SQL. The documents are created and uploaded manually to Microsoft Sharepoint, so the feature of detecting updates to existing documents or detecting the arrival of new documents would be available.
This will work quite well, but it did come with some restrictions.It also requires using Microsoft SSIS along with MS SQL Server, adding to the number of hops for the data as well as complexity.
Custom scripting
Finally, having decided to pursue the scription route, I decided to use Python. Based on my past experience with Python for similar use cases (professionally as well as in other side projects), it is best suited for data analysis due to its large library support for data exploration, modification and analysis. Python also allows using a variety of built-in libraries, including Pandas, which makes data transformation even simpler and customizable.
Transformation
Going back to the considerations section, we have both querying and reporting use cases. As I explained the extraction section above as well as the storage section below, the destination is a relational database.
The most suited structure to allow efficient reporting is a star schema.
Implementation steps
- Extract the data from the MS Excel sheets via Python. This includes the logic to identify and parse all the relevant files, to detect changes to existing files and to detect newly added files.
- Write the required/selected data from the sheets/documents into a staging schema in MS SQL Server.
- Write an ETL flow to convert the transactional schema to a dimensional schema, with appropriate dimensions and facts.
Some nuances and additional facts related to the above steps
Initially, I explored Pandas to extract and transform the data. However, due to the structure of the data, this method proved cumbersome and slow. The entire dataset had to be read into a Pandas Dataframe and selection, filtering and data presentation for the final step was tedious.
So, I used the openpyxl
library. The library contains a rich set of features for reading MS Excel data.
Example code snippets from Openpyxl:
from openpyxl import load_workbook
# Open and load data from an existing Microsoft Excel workbook
wb = load_workbook(filename = 'empty_book.xlsx')
for sheet in wb:
print(sheet.title)
# To access all rows in the active sheet
ws = wb.active
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2):
for cell in row:
print(cell)
# To access all columns in the active sheet
for col in ws.iter_cols(min_row=1, max_col=3, max_row=2):
for cell in col:
print(cell)
# Selecting all the values from a sheet
for row in ws.values:
for value in row:
print(value)
Storage and Business Intelligence
I discussed this briefly in the transformation section’s implementation steps.
Once the data from the spreadsheets land into the staging schema of Microsoft SQL Server, a separate pipeline starts ingesting this data into a dimensional model: with identified dimensions and facts(measures).
Some key dimensions: date, products, versions
.
For the reporting layer, Tableau was a good choice. MS SQL Server connects quite well with Tableau and different kinds of dashboards were built (by me and users) on top of the star schema tables. I wrote some views on top of the tables as well, primarily to join the dimension and fact tables.
Conclusion
This was an interesting project to convert structured data into a dimensional model. This also allowed some experimentation with Python which was later used in other projects.
Overall, the dimensional model and the Tableau dashboards allowed users to access and visualize data in new ways and to make decisions out of them.