 
Designing a Data Pipeline to Meet Your Business Requirements
Designing a data pipeline is a critical step to get right. The design should always start with your business use case -- you need to ask yourself, “What problem am I trying to solve?” In our case, we need anomaly detection on massive amounts of logs in real-time, with virtually zero latency, and we need the data to be anonymized (to protect PII). Our ML models train and predict on embedding that are extracted using deep neural models for NLP processing. For this purpose, we had to build two pipelines - one for training, one for prediction. We separate out training from prediction so that our training engine can take a little more time, while our prediction engine remain lightning fast. Our training pipeline doesn’t need to be real-time as we requires Human-in-the-Loop for maintaining fairness and quality of the models. Hence, our training data pipelines are batch pipelines that work on massive amount of raw data in few minutes. On the other hand, our prediction data pipeline is real-time. An astute reader may be confused here asking why have same functionality of processing raw data in two different pipelines. Well, we don’t duplicate functionality in different pipelines. Our architecture is designed to help us to create logical data pipelines and reuse modular code in multiple pipelines.Micro Data Lakes Help with Privacy and Cost Reduction
Our design and philosophy centered around micro data lakes. The main design criteria for our data lake is storing identity event messages in a secured fashion with no Personally Identifiable Information (PII). To achieve this, we went for a micro data lake architecture. Rather than bringing the data to a central location, we have elastic on-demand pipelines that are spun up to process streaming data and store the extracted features in a feature lake along with raw information that we deem necessary for explainability purposes. This provides us with great flexibility to consume different sources of data and also keep a small cost footprint.Elastic On-Demand Pipelines Provide Massive Scale
Thanks to our micro data lake architecture, our data pipelines can be launched on demand depending on the flow and volume of data. We leverage Apache Beam, Google Dataflow, GKE and our own homegrown meta-data eventing system to proficiently trigger, process and shutdown our data pipelines. We can run our data pipelines on different runners’ environment both on-premise and in the cloud.Which is Better: Horizontal or Vertical Scaling?
Decisions around scaling is completely data driven for us. We use custom metrics around the functional section of our pipelines to record CPU , memory used and other system/app level metrics. This helps us in vertical scaling our pipelines which, we achieve in refactoring our pipeline code, introducing async processing using advanced threading techniques and leveraging data buffering techniques. An outcome of such optimization led us to embed our embedding model in our pipeline and decrease our processing time by 3.5 times.We use horizontal scaling to increase our capacity based on the volume of data. This can be easily achieved with serverless design using runners like Google Dataflow. Google Dataflow takes hints from our configuration files and scales horizontally as required. This kind of on-demand scaling helps in accommodating increased data traffic and processing capacity.
 
 
 
 
 
 
 
 
