Lambda Architecture in data systems and possible meaning of this name

(2021-Sep-21) There is one scene in "The Core" movie that I really like when two geophysicists were asked to explain certain anomalies that were projected on a computer screen. One of the geophysicists started to mumble in his attempt to answer the question, the other one told him, "Say it with me: I don't know". Though the scientific accuracy of this movie is more than questionable, the acting of this character reveals a very important and valuable virtue to recognize personal ignorance instead of portraying an informed unawareness.

Here is a common data scenario: you have massive incoming data that flows into your data storage. Your immediate intention is to have all pre-calculated aggregations based on this incoming data right away, but you also understand that achieving correct and accurate data aggregation is a time-consuming process: you’re facing a dilemma to balance your data latency (the time between the creation of data in a source system and the exact time at which the same data is available for end users on the business intelligence platform) and throughput (a measure of how many units of information a system can process in a given amount of time).

There are several options to balance data latency and throughput:

  1. You tank your intentions to get accurate results in a short period of time based on the incoming sourcing data and settle with semi-correct aggregations that can be produced in almost no time: low latency Ã³ high throughput Ã³ incremental processed data is almost correct.
  2. Opposite to the first option, you settle with a significant waiting time to produce accurate output data that will meet all of your data quality standards: high latency Ã³ low throughput Ã³ accurate processed data.
  3. Or you can tell yourself, I will be good with the almost correct data (1st option), but only for a short period, and during this time I can wait for the same sourcing data to be thoroughly and correctly processed (2nd option) to replace short-lived “almost” correct results form the 1st option. You're looking at the 1st and 2nd options combined.
That’s where the Lamda Architecture will shine by providing two data layers to process your incoming sourcing data: [1] Stream layer to support low latency & high throughput and [2] Bath layer to support high latency and low throughput.

Image source: Big Data on Azure with No Limits Data, Analytics and Managed Clusters

Stream or Speed layer is purposed to handle recent sourcing data and will only exist between the last and next data batches. This layer’s data becomes available for end-user almost immediately but the speed of data access is sacrificed with the accuracy and cleanness of the results it provides.

Batch layer, contrary to the speed layer takes more time to produce correct results and is used to fix or repair any incompleteness or incorrectness of data generated by the stream layer.

Serving layer provides a combined view with accurate data up to the last processed batch along with almost immediately available new data in the stream data layer, which is still almost "accurate".

This whole agility to provide output data by processing it accurately and efficiently using both Speed and Batch layers lead to maintaining two different technological solutions, which may be difficult in a long run.

But what about the meaning of the Lamda Architecture? What does Lamda mean in this term? 

Initially, I thought that it was related to the Lamda Calculus that expresses computation based on function abstraction and application using variables, similarly to the Data System equation in the Marz, Nathan’s article: "How to beat the CAP theorem”, which defines a data system:

Query = Function(All Data)

However, this formula is very generic and can be applied to other data system architecture as well. 

Then I thought that Lamda Architecture and Lamda specifically is named this way because of the visual representation of this Greek letter (λ), where two data streams (speed and batch layers) are divided and sourced from the same origin.

But what was the real origin of the Lamda Architecture term, I still don’t know, and I’m OK with this :-)

PS. If you have endured and read my whole blog post, here is a link to a video clip from “The Core” movie where both geophysicists had exposed their ignorance (or lack of it) by admitting that there are some things they don’t understand.



Comments

  1. I think it does come from functional programming. core tenants in both being immutability and querying via functions without side effects (idempotency).

    I don't like lambda b/c it assumes a different ETL path depending on if it is batch or stream. Double the work is not good. Better approach is kappa architecture (kappa being "one better than lambda") where the batch layer is removed and all data access goes through streams. This is what kafka and spark folks advocate and it is seriously much easier to work with. In fairness, I think when people say "lambda" they probably just mean kappa.

    ReplyDelete
    Replies
    1. Thanks, Dave. I also agree that streaming data approach is better. My next journey would be explore why they named it as Kappa :-)

      Delete
    2. Genesis of the name Kappa is "one better than Lambda"

      Delete
    3. Also, [Kappa] comes before [Lambda] in the Greek alphabet. There must be something there too :-)

      Delete

Post a Comment