What is Data Engineering?
We define Data Engineering as a process of analyzing, designing, building, testing, improving and maintaining tools and systems whose role is to store, provide access to, transform, merge, and move heterogeneous organizational and external data.
What is the difference between Data Engineering and Data Science?
No formal definition of Data Science exists, but we can think of Data Science as a collection of tools and processes to apply the scientific method to available data in order to generate new knowledge, usually with the goal of solving one of three problems: the one of prediction, of inference, or optimization. Because Data Science can begin only after Data Engineering has occurred, we can think of Data Engineering as a prerequisite for Data Science: without availability of quality data there is nothing to apply the scientific method to!
Data-driven projects are a team sport
The biggest mistake business leadership can make when staffing a data-intensive project is assuming a deliverable is one man’s job. Guided by this erroneous assumption, HR and management embark on search for a unicorn, and make a series of bad decisions about requirements for combination of skills, interviewing process, compensation and job performance expectations of the individual hired for the role.
The reality is a bit different: to deliver value-generating solutions, Data Engineers, Data Scientists/BI Engineers, Business Analysts, Software Engineers, and Product/Project owners must work together, so that each team member could focus on his own area of expertise. Depending on the complexity of the solution and confidentiality of data, the team may also need a Statistician/Mathematician Subject Matter Expert, and a security/encryption specialist. The entire team then can deliver greater results through effective collaboration, reaping benefits of specialization. It is quicker and often cheaper to assemble an effective team consisting of several people than finding and retaining a unicorn.
If we take a look at how responsibilities differ among the members of the data project team, it will quickly become obvious why most of the “Data Science” projects delivered by “Jacks-of-all-trades” fail:
Data Engineers build systems and choose tools to serve only one purpose: to make the data available for further processing given the particular business use case, but they don’t need to know the difference between the logistic regression and random forest algorithms – it is Data Scientist’s job to select the right predictive model given the nature of business problem, business conditions and the nature of data. Data Scientists and BI Engineers, on their end, need to work with Data Engineers and Business Analysts to communicate their requirements for retrieval of the desired variables in their datasets.
Data Scientists or BI Engineers are experts in their own field: Data Scientists use statistical methods, machine learning techniques or optimization algorithms, while BI Engineers use visualization tools and reporting automation technologies to cater to the needs of a particular group of business users. They deliver a working prototype and then work with Software Engineers to integrate, test and automate their solution. Data Scientists and BI Engineers do not need to be experts in building NoSQL data stores, proficiency in understanding what GraphQL is useful for, using DevOps tools, and neither be experts in cloud computing: these are often Data Engineer’s responsibilities.
It is the job of the Business Analyst to work with Data Scientists and BI Engineers to provide insights into underlying business processes, industry practices, regulatory constraints, and complement data with anecdotes for consumption by decision-makers. Business Analysts help Data Scientists in defining requirements for Data Engineers when deciding which data to collect, store, and whether to encrypt the data when storing. People in this role are the first to find the data governance policy document. SQL knowledge is a very useful skill for a Business Analyst to have, but their detailed knowledge and expertise of underlying business process, strategy, internal policies, industry regulations, and needs and tendencies of decision-making people within the company (or a client group) is what makes them most valuable.
Software Engineers will implement machine learning models or reporting automation prototypes into production, integrating and maintaining the solution as part of the company’s existing tech stack. They don’t need to know what Recurrent Neural Networks are, or why the data was transformed in a particular way, or why some variable is coded as boolean rather than a category. They leave the data format and algorithm selection tasks to Data Engineers and Data Scientists. Software Engineers also support and maintain the solution perpetually.
Finally, it is Project Manager’s job to communicate with business stakeholders and negotiate around bottlenecks, constraints and agree on tradeoffs between the deliverables of the project. Project Manager has the responsibility to facilitate collaboration within this diverse team of experts, and foster a productive working environment while having a bird-eye view on different moving pieces of it. The individual in this role is ultimately responsible for the delivery of a final solution, but they need not be experts in each field.
What we do best at Ballast Lane is Data Engineering
We at Ballast Lane Applications are fortunate to have a dedicated team of experts, and can either augment your existing team in an area of need, or we can supply the entire team to own a data-intensive project: from conceptual design through detailed analysis, build, testing, and ongoing maintenance. Just a few examples of recent Data Engineering deliverables to our clients:
InsureTech Startup: Data vendor selection and data pipeline architecture for workflow automation use case. We performed business and technical validation of 30+ external data sources, comparing data originating at different vendors for accuracy, consistency, fill-rate, cost of acquisition, quality of documentation, skills/responsiveness of the vendor support team, and latency.
FinTech Startup: Data lake for business intelligence and analytics use cases. We engineered data pipelines, SQL and NoSQL data stores, and microservices architecture. Our tools pull and transform data from hundreds of sources: financial institutions, publicly and privately available analytical services to deliver the real-time portfolio value and composition data to app end users.
Department of Economics at one of the largest research universities in the United States: monitoring web application for business intelligence and analytics use case. BLA designed and built the data pipeline infrastructure and databases to pull, transform and store the data from thousands of devices and external data sources, and built a dashboard to display real-time operational data. We work together with our clients to determine how their current or potential use of analytical tools and data helps them achieve their desired objective. After we determine the strategy, our Data Engineers architect a cost-effective minimum viable product (MVP) which serves the basic analytics needs of our client. If the initial solution has proven valuable, we scale and adapt the product based on the client feedback, add features and scale using agile methodology.
Food for thought: what factors drive demand for Data Engineering?
Data Engineering is an intermediary economic good, and demand for this type of good is determined by the demand for the final product: what are the final consumers actually buying? Consumers are willing to pay for things which bring them value. Is your data-driven initiative able to satisfy the needs of your final consumers at the lowest cost for them? We show how each data use case helps individual consumers satisfy one or more of their needs: safety, attainment of their purpose, wealth, desire for power and control, or entertainment. Businesses normally pursue profit, while non-profit organizations and governments have a specific purpose.
Data Value Chain example:
Autonomous drone surveillance system of a large manufacturing facility serves to satisfy the need for safety and risk reduction from theft of property, vandalism and/or sabotaje, or from loss of life/health due to workplace accidents. The manufacturing company is willing to pay for the drones and its data systems, because it is cheaper and safer to operate the fleet of drones from the control room than hiring multiple drivers and buying vehicles to patrol the premises. Also, drone cameras have access to the areas where it is difficult or dangerous for a human security officer to gain access to. The drone is part of the “Autonomous Systems” data use-case group from the diagram below. Complex data engineering tools are used to capture, store and retrieve the data coming from the drones: streaming image and video data needs to be available immediately to operators of the control room and stored, flight and path data needs to be stored and available for retrieval at a later stage. Video and image data needs to be retrievable for future use as well: by investigation teams, police, and serve as evidence in courts, but the need to access them is infrequent. Data Engineers design systems to serve the need for video and image data availability of the users with strict access protocols. Flight data may be analyzed by the security operations team and management of the facility to make better fleet operations and maintenance decisions, and requires more permissive privacy protocols. Finally, engineers of the drone manufacturer may need to analyze the flight data to improve the drone flight performance: for example, they are interested to know how the drone handles wind, runs out of battery, or to know frequencies of communication errors when communicating with the operator. Therefore, Data Engineering supplies data for manufacturer to make their factory a safer place, to drone company to make their product better and pursue profit, and maybe even to public for consumption and entertainment if an act of public interest is captured by cameras of a drone and subsequently released by the plant management.