Lambda Architecture On Elasticsearch

02 Sep 2015

This is a blog post written in the past and brought here with minor changes only. There will be a another version with more insights coming out soon.

Elasticsearch provides an extremely robust platform for building custom analytics application through flexible aggregations. Although, as data goes into 100gb range, the query performance starts to degrade, so a mechanism like lambda architecture is needed to ensure the queries are running fast (especially on fields with high cardinality) without increasing the infrastructure cost.

We have used similar architecture and design for a mobile advertising product, Adatrix. The raw data and batch views (created through background jobs) were both stored in elasticsearch in different indexes. UI was using only the batch views and the speed layer was not implemented as the requirement for real time statistics was not so much there, although as were using an in-memory database Aerospike, the speed layer could be implemented whenever a requirement comes up.

Cardinality is important factor in elasticsearch query performance and the fields with high cardinality seems to have higher performance degradation. In raw data, there is very high possibility of id field (like session id or message id) having high cardinality and statistics would have to be built on those fields.

Few important considerations for creating batch views

breaking batch views by time (like hourly, daily, weekly etc) helps in keeping the cardinality of these fields reasonable for queries which run on raw data
breaking batch views by dimension (master dimensions included for creating the batch view, for example if your data has attributes like location, device, advertiser, publisher, you can select some of those to create batch views and rest of the dimensions will be flattened)

Few important considerations for queries

Direct queries on raw data (should not be done of very large data)
Rolling up of summaries or batch views
Batch views limits the kind of queries that can be done (as dimensions are flattened, cross dimension is not possible)

Also, for storing raw data on elasticsearch, few important optimizations are required

only keeping reverse indexed data and not raw data (removing _source storage)
keyed fields instead of raw text fields
optimizing elasticsearch storage (removing _all, keeping cardinality of fields reasonable, if possible)

Lastly, one important consideration is with finding unique values for high cardinality fields like audience. The unique value cannot be rolled up, for example, the unique audience per day can’t be added to find unique per week. So for this, some approximate algorithm like HyperLogLog has to be used to find unique statistics for high cardinality fields.

Background - Lambda Architecture

In case the data generation per day goes above 3gb, your data generation is going into the big-data use case and now the most popular architecture for this kind of data generation is Lambda Architecture defined by Nathan Marz of Apache Storm (contributed by Twitter), more details provided in the end. Briefly, lambda architecture is layered architecture with speed layer (creating real time views for recent data), batch layer (raw data and creating batch views on older data) and serving layer (which combines queries through the batch views and realtime views).

Interesting video from Yieldbot (an advertising company)

Attribution Reporting - Beyond Last Touch Point

10 Sep 2011

This is a post written in the past and brought here with minor changes only.

Introduction

Accurate attribution has become increasingly important aspect in digital advertising due to the fact that users are being reached through multiple channels and touchpoints. And to determine the ROI from a particular channel or touch point, it is extremely important that returns are rightly attributed. This not only helps in better understanding of the results of the campaign already executed but it will also provide a great insight and direction on what should be the media planning for future campaigns.

The concept of purchase funnel which starts with creating awareness then generating interest and later invoking desire which finally results in an action, has not been built into digital advertising systems (which mostly depend on last click model). As these systems are maturing, attribution reporting and modeling is being added which will help in accurate ROI calculation.

Attribution reporting is basically retroactive reporting which helps to compare contribution of each type of touch point and ad events while attribution modeling is more like a proactive “what if” analysis which helps in optimizing the ways ad events occur based on attribution model. For example, a campaign which is creating awareness for a newly launched product will give most importance to last impression and reducing it with further with old impressions (decay). When a conversion happens, the attribution will be given to all the touch points which happened to be in the path to conversion based on the attribution model designed.

Attribution modeling can also help in better bidding with RTB systems as the bid will be determined using attribution model. It can also provide optimization opportunities based on consumer responses. But due to the impact it has, the modeling has to be done with caution. With Adatrix, we are building attribution reporting and providing ability to specify simple attribution models. In future, we will bring attribution modeling in bidding process and trend based optimization.

Presentation

Have created first feature presentation on attribution reporting & modeling below. Will convert it into reveal.js presentation sometime.

https://www.slideshare.net/bizense/attribution-reporting-and-modeling

How DSPs, Exchanges and SSPs have evolved?

07 Sep 2011

This is a post from the past and brought here with minor changes only.

Background

Last week during discussions on integrations with exchanges, DSPs etc. we were curious to know about how these have evolved and what is their uniqueness in digital advertising landscape. Briefly putting below my understanding on various jargons which are being used in online advertising today.

Considering the general workflow, advertisers hire agencies (for expertise) to spend their money, these agencies have buying desks which has relationships/partnerships with different entities on the supply side. Now two things have evolved during last 3 years mostly surrounding real time bidding.

Buying Desks along with the traditional buying relationships and partnerships now have something called Automated Trading Desks which are similar to DSP but are fully owned by agencies. Most of automated trading desks are running on technology either licensed from another company or acquisition. The video highlights the point of conflict when any technology company works as an agency or vice versa. But this conflict is not occurring till now with Google’s display network which is pretty strange.

Ad networks or publisher networks started out with simple model of combining publisher sites into verticals, but due to abundance of networks and lack of differentiation, these networks have re-branded themselves as DSP and SSP. Even now many of them don’t have technology platform but provide a combination of licensed technology with inventory they had earlier. Many of these will now evolve to private exchange (providing the benefits of real time bidding along with the inventory they bring) which seems to be the next logical step. Private exchange concept is picking up to increase the spending of direct buy through real time bidding model. Exchange was mostly used for remnant stuff not just in terms of inventory that is left out but also in terms of money that is left out after premium buy.

Some links which will bring clarity on these are provided. Most of the links are interesting to read giving slightly different perspective and argument.

Definitions

DSP – Demand side platform – providing integrations with exchanges and ad networks to buy with RTB Ex: AdChemy, X+1, Media Math, DataXu
Exchange – RTB – Real Time Bidding – providing auction capability across different kinds of systems (DSP, SSP, Networks) Ex: DoubleClick, Right Media
SSP – Supply side platform – providing integrations with exchanges and ad networks to sell with RTB Ex: Rubicon, AdMeld, Pubmatic
Networks – Used to refer to ad networks which are basically publisher networks – Thousands of networks are there

Akhil's Blog Thoughts, Ideas, Essays & Views