The diversity of modern data technologies leads to new challenges in establishing a consistent and accurate data view for data consumers. In light of this issue, a semantic data layer introduces a means of harmonizing a single point of view for business metrics, no matter how many different data storages or data consumer tools are in use. This practice was popularized by big modern IT companies such as Airbnb, Linkedin, Twitter, etc. Before delving into practical data methodologies, we will first briefly illuminate the history, challenges, and purpose of the semantic layer concept.
A semantic layer presents itself as a business-friendly data model, designed as a critical mediator, facilitating an interface between data silos and end-users. It streamlines enterprise data interpretation by consolidating metrics from various data systems into a single source of truth, fostering accuracy and semantic consistency. The foundational idea of a semantic layer isn't new, and variants of it have played a similar role in the past.
The concept of a semantic layer has existed since the early 2000s, originating from monolithic unified semantic services like Microsoft's SSAS and IBM's Cognos or SAP Business Object. It served as a costly and inflexible abstraction layer crucial in securing analytical workloads for enterprises, primarily for report generation. Any changes demanded by business users would lead to considerable delays in delivery, mainly because these sophisticated systems necessitated substantial IT efforts to manage and support the reports.
The monolithic systems fell short in meeting the heightened demands of analytics and data consumer tools. In response, tools such as Power BI, Tableau, Qlik, and ThoughtSpot began incorporating their independent semantic layer tools. The goal was to empower business users to make modifications autonomously, accelerating the time-to-market and reducing the total cost of ownership (TCO) for reports.
Despite the advantages, this development resulted in a wide variety of siloed data consumer and storage tools, introducing new challenges to the modern data stack. As the diversity of these tools increased, the need for a unified approach to the semantic layer began to emerge. We will dig deeper into these challenges in the following section.
In the evolving landscape of data-driven organizations, tools like business intelligence (BI) solutions play a critical role. Moreover, Snowflake's customer research, presented at the Semantic Layer Summit 2023, reveals that half of their customers use more than one BI tool to look at and understand their data. But with each tool comes an array of capabilities, including the ability to create its own semantic model. As a result, many companies end up with as many semantic layers as the number of BI tools they use.
Furthermore, this proliferation doesn't stop at BI tools; it extends to data science teams and data apps, each implementing its own semantic layer. It's akin to having a multitude of different dialects spoken within the same company, leading to miscommunication and confusion.
This 'data dialect dilemma' gives rise to several tangible issues. Firstly, data inconsistency across multiple semantic layers often results in inaccurate decisions. Secondly, there are hidden costs associated with reconciliation efforts when trying to align mismatched data views. Lastly, governing data and ensuring its security becomes a complex task.
So, how do we navigate this labyrinth of semantic layers? The solution lies in building a unified semantic model that provides a single, consolidated view of metrics for all data consumers. Such a model provides a shared data view, eliminating confusion and enhancing data accessibility, and security across the organization. This article will delve deeper into the value that a unified semantic layer brings to data applications, painting a clear picture of how it can improve data management and decision-making in your organization.
Additionally, Grid Dynamics offers an Analytics Platform Starter Kit, available on the AWS Marketplace, implemented using cloud-native services and Cube that enables rapid creation of unified semantic layers.
The concept of building a unified data view can be put into practice through the integration of a unified semantic layer between data stores and data consumer systems. As a result, this intermediary layer consolidates all metrics, measures, and dimensions from the siloed storages and data consumer tools in a manner conducive to business understanding. The architectural vision is depicted in Figure 1.
According to the modern data stack, cloud data warehouses, lake houses, data lakes, real-time data streams, and product stores in a data mesh topology can serve as the data sources for a unified semantic model. In this scenario, the siloed source data is integrated into a unified semantic model by applying the transformations.
The data transformations in the semantic layer provide a centralized and business-friendly representation to data consumers. Additionally, data consumers can manage a semantic layer with minimal involvement from IT or data teams, ensuring the self-service nature of the process. The upcoming sections will explore the capabilities of the semantic layer and the benefits it offers to end-users.
This proxy layer, which encompasses a unified semantic model providing a singular source of truth for metrics, is equipped with a set of capabilities to ensure its efficacy. These capabilities are outlined in Figure 2.
Data modeling: The service should provide functionality to build a business-friendly data model encompassing metrics, measures, and dimensions. It should be easily manageable by end-users with minimal assistance from data engineers. To ensure this, the service must provide:
Data governance: Centralized data governance facilitates easy access to policy management for end-users, eliminating the need to navigate multiple storages and data streams. The onboarding of new end-users should be easy and fast. To achieve this, the data access approach must be aligned with corporate standards. The security policy should extend across data apps using industry-standard protocols, such as GraphQL, SQL or HTTP API.
Caching & query optimization: The proxy layer is a good place to boost the hot patches for data loading from the storage by utilizing the in-memory cache. Furthermore, analytical queries can be easily optimized by caching pre-aggregated results, which are stored as materialized query outcomes.
Analytics API: The consumption of data necessitates a specialized API, such as SQL API, GraphQL, or REST API. This API enables seamless connection to various consumer systems, including BI tools, Jupyter notebooks, and data apps.
Rather than maintaining siloed data access layers within each consumer tool, data consumers receive a consistent and unified data view via a self-service approach. This transition offers new value by enabling businesses to build faster and more accurate data-driven campaigns. Let's explore this value further to comprehensively understand the key motivations for integrating a unified semantic layer in the existing organizational culture and data stack.
Having a single source of truth for metrics empowers the creation of dashboards, reports, ML models, or embedded enterprise-wide analytics rooted in reliable data, thereby leading to accurate data-driven decisions for end-users. Moreover, each category of data consumer reaps unique benefits:
In the following section, we will discuss our semantic layer reference implementation.
As we transition from theory to practice, we can leverage semantic layer services like Cube, AtScale, Transform, and dbt Semantic Layer for implementation.
Each service has advantages and disadvantages, however, the most general-purpose, cloud-native, and cloud-agnostic service is Cube, which can be used in startups and big enterprises. Cube can be hosted on Microsoft Azure (Azure), Amazon Web Services (AWS), Google Cloud Platform (GCP), or even on-premises through Cube Core deployment. It consolidates data definitions from a wide range of data stores into the semantic model and efficiently handles proxy queries from data consumers. The Cube implementation aligns with the semantic layer capabilities described earlier and is illustrated in Figure 3.
The overall reference architecture is depicted in Figure 4.
This design outlines a generic implementation of a semantic layer and serves as a starting point for adoption. It’s important to highlight that the given architecture is cloud-agnostic and accommodates any Cube-supported data storage, including Amazon Redshift, GCP BigQuery, Azure Synapse, Databricks, Snowflake, Dremio, or data streams powered by ksqlDB or Materialize.
An example of this cloud-agnostic architecture is provided in Figure 5.
This architecture framework outlines a general semantic layer architecture that can be adopted for cloud-native solutions as needed. The semantic layer, powered by Cube, can be seamlessly integrated with cloud-native or cloud-agnostic data warehouses, reporting, and analytics dashboards.
Semantic layer adoption involves consolidating problematic core metrics into the semantic service. The adoption strategy is divided into discovery and implementation phases.
Discovery entails exploring functional and non-functional requirements. Initially, problematic metrics need to be identified across teams and departments. Key indicators include:
Subsequently, the identified metrics should be categorized by domains and prioritized.
Ultimately, a solution framework must be constructed with suitable tools and technical prerequisites aligned with the company's DataOps Strategy and compliance. Our reference architecture can be a valuable starting point.
Implementation, the most time-intensive phase, involves semantic data modeling, integration of data sources and data consumer tools, and the establishment of a new data security process.
Additionally, changing the data culture represents a complex aspect of adoption. The integration of the semantic layer significantly alters how people interact with business data. Investment in workshops and training programs can alleviate certain challenges associated with the adoption of a unified semantic layer.
The notion of a unified semantic layer has matured and advanced over time, aligning with modern data stacks and presenting novel benefits to businesses. However, semantic layer implementation is intricate and demands an expert understanding of data integration solutions. To mitigate these challenges, Grid Dynamics offers an Analytics Platform Starter Kit that accelerates the creation of a tailored proof-of-concept to suit your unique case. Moreover, Cube, which forms the core of our solution, is versatile and independent of data stores and consumer tools, facilitating seamless integration with various services.
Our starter kit is designed to expedite your time-to-market, aiding you right from the initial stages of semantic layer adoption to enhancing the overall data culture. Get in touch with us to start a conversation about integrating a unified semantic layer into your data environment.