Introduction

This post is about a big data deployment we (myself and a few others in my company) did a few years ago. This was a time when only a few companies were cloud-first. Some of them hadn’t ventured into the cloud at all and cloud adoption (mainly AWS at the time) was divided into part on-premise/part cloud to completely on-premise.

This makes for some interesting choices for the tech stack. For instance, compute and storage had to be carefully managed. Permissions had to be manually handled as well.

Use case and background

Prior to this exercise, most of our datawarehouse was on Postgres (again an on-premise deployment). That worked for a few years until the data volume grew from GB/week or month to TB/month or GB/day or week. We had to start exploring alternatives which were:

Scalable.
Maintainable.
Performant for the volume of data.
Aligned with the overall stack being used by the company at the time.
Proven to be effective via a Proof Of Concept.

After exploring many such alternatives and discussions with teams having similar use cases, we decided on the following:

Tech stack

Almost all the tools/technology used are Open Source, except the Hadoop cluster, deployed via Cloudera.

Area	Tools/technology
Compute	Hadoop
Storage	HDFS
Orchestration, workflow management	Apache Airflow
Containerization	Docker (published to GitLab container registry)
Programming Language	Python 2.7 at the time and later 3.x, PySpark
Analytics/data presentation layer	Presto on Apache Hive with Spark compute (wrapper on Apache Hive)
CI/CD	Custom-built Airflow operators

Architecture

The diagram below shows an overall architecture. We had two data sources, both of them housing fairly large amount of data - in the order of gigabytes per week to petabytes on a monthly basis.

Implementation

The implementation contains the following components:

From a Data Engineering perspective:

Reading from the data sources.
Transforming the data.
Creating appropriate destination data structures (Apache Hive tables).
Loading the data.
Backfills, fault tolerance of the data pipeline.
Alerting - for workflow/job failures or upstream dependency problems.

From a Data Infrastructure perspective:

Managing code versions.
Handling dependencies (and its versions) required for the code to run.
Deploying code changes.
Permissions.

Design decisions

Programming Language

Python/PySpark.

Spark API is available in 4 langauges: Scala, Python, Java, R. Of these, Scala and Python were the top contenders, since Java was a new language for our tech stack and offered little unique advantages compared to Python and Scala.

The trade-offs between Scala and Python:

Spark performance: - although Spark offers APIS in Python and Scala, the initial versions of the Python API was not as performant as the Scala API. However, later versions of the Python API caught up with the Scala API’s performance.
Ease of use: Did we have enough knowledge within the team to use Scala comfortably? Is it worth maintaining one project in Scala whilst all the other projects have historically been in Python?
Maintenance: How easy is it to maintain code? How are dependencies managed?
Learning curve with Scala - since Python was our primary language, some of us had to pick up Scala as new programming language, which added to the complexity.

To explore the above items, we performed small Proof Of Concept exercises, such as:

Benchmarking API performance between Scala and Python.
Building dependencies in Scala using sbt vs. pip in Python.

Orchestration and workflow management

Apache Airflow

Task management, scheduling and other actions had to be managed. Apache Airflow, hosted with a Postgres backend was already available. We chose to use it.

Airflow did not have all the operators we required. So, we ended up writing custom operators for the following:

Python (running Python scripts)
Apache Hive (running queries on Apache Hive)
Shell operator (executing bash or other shell commands)
Docker operator (running Docker commands)

Analytics

Apache Hive/Presto

Thinking beyond the data pipelines, the decisions related to data consumption came about.

Some options were:

Apache Hive -> Postgres -> Tableau/Queries/Jupyter notebooks.
Apache Hive -> Presto -> Tableau(using Presto connector)/Jupyter notebooks.

File formats

Parquet.

Options:

JSON.
CSV.
Parquet.
ORC.

Reasons: Compression options, schema evolution, partition discovery, columnar storage and encryption and overall better performance.

Compute

A shared Hadoop cluster was already available and only permissions had to be created to access the cluster.

Storage

Straightforward HDFS storage with a Apache Hive metastore. We created a new Apache Hive metastore for our project since the entire storage was shared.

CI/CD

We used GitLab for the following:

Code versions.
Publishing docker image versions to GitLab container registry.
Running pytest checks.

Consumption and analytics

Now that we have discussed the ingestion pipeline, the next step is making the datasets available for users.

Since we are dealing with large data volumes, querying Apache Hive directly wasn’t scalable (via Presto). We used Presto on Spark.

Analytics was done on a Jupyter notebook, so the query results were used to build visualizations.

Analytics teams built their own DAGs using the datasets we created.

Learning

Performance

The Hadoop cluster was performant to ingest data at the rate we expected. The query response time was dependent on the partitioning of the Apache Hive table.

Improvements

Some future improvements:

Managing Python dependencies/packaging better using npm.
Creating leaner docker images to improve deployment times.
Moving away from Apache Hive (see the section on Moving to the cloud).

Moving to the cloud

When the migration to the cloud was started, the following changes were made to the architecture:

Type	On-premise	Cloud
Query Engine	Hive	Snowflake
Storage	HDFS	AWS S3
Compute	Hadoop	AWS EMR, EC2
CI/CD	Gitlab CI	AWS Codebuild
Docker registry	GitLab container registry	AWS ECR
Orchestration, workflow management	Apache Airflow	Amazon Managed Workflows for Apache Airflow (MWAA)
Permissions	Manual	AWS IAM

Conclusion

Overall, the project helped us foray into the big data area and keep up with current and future data needs of the organization.

The cloud allowed us to scale and automate many of the steps for development, deployment and maintenance.