How solid is your data estate?
Time and time again, we see data estates that have been built using outdated patterns that encounter problems when trying to scale.
In this article, we give you a couple of tips to consider when laying down the foundations of your data estate.
#1: Beware the siren song of ‘drag and drop’ configuration
It can be very tempting to put together a solution using drag and drop tools. They have a low barrier of entry, quick to edit and can produce fast results. However, they also can:
- Introduce an iceberg effect of making complex tasks appear to be simpler than they are
- Become difficult to scale with large solutions as they generally favour manual configuration over automation
- Be a challenge trying to enforce consistency across a solution
#2: DevOps is the path
This has been the gold standard for application development. However, we still see a slow uptake of its use in data estates, which consequently cause manual and error prone deployments that lead to more stress and reduces the pace of change. DevOps has proven to:
- Reduce development cycles
- Reduce implementation failure
- Increase communication and cooperation
#3: Consider Spark
You do not need to have a ‘big data’ workload to benefit from the use of Apache Spark as your data transformation engine. Spark enables:
- The use of combination ‘set based’ logic (i.e. SQL based queries) with ‘imperative’ logic (e.g. python code). This gives your developers a consistent mechanism to perform any data transformation, despite its complexity
- The combination of real time and batch transformation using a unified processing engine
- Close collaboration between your data scientists and data engineers. Historically, they have operated in different toolsets but with Spark, they work together on a common platform
#4: Look towards automation
We created the product ‘LakeFlow’ to help rapidly build resilient data estates. LakeFlow is a data engineering service which will:
- Deploy a data estate within your Azure environment using only Azure first-party components
- Generate pipelines and onboard new data sources to your data estate quickly. This allows you to focus on your dashboards and insights
- Automatically maintain a historical record of your data, in a cost-effective data lake
- Proactively monitor your pipelines, picking up anomalies in data volume flows before failures occur
If you would like to know more or need assistance in building rock solid data estates, contact us.
*This blog is sourced from acquired company Data Addiction.