Big Data Guide

Mar 13, 2018

Now you have the habit that with each new edition of the Big Data Guide, Hurence presents its flagship work of the year. This year we have been working on a problem that is very common for most of our customers who are evolving in e-commerce.


The goal is to retrieve this data into Data Lake and enrich it for a variety of reasons. A first reason is simply related to the re-appropriation of the data. Our customers want to have full access to their data (Google rendering only a partial and / or aggregated view of navigation data, not all raw data). They also want to be able to better control and protect their business data. The second reason is to set up real-time analytical processing. Recommendations / contextual information are instantly offered to the user according to his behavior. These contents are produced in real time from user and company data (depending on needs, margins, stocks, partnerships in progress, …). The third objective is to analyze the reasons for the failure of a sale, and to know the data that have been missed by the user to make it a reality. Hence the interest of repatriating this navigation data to cross and enrich with other sources of external data / business (ERP, PIM, user base, competitive info …).

We have implemented processing chains based on our open source LogIsland product to enable real-time recovery of navigation data while retaining part of it at Google Analytics. The company’s website is not modified because we use the same tag manager as for Google Analytics (Google Tag Manager): the instructions for sending events feed both Google and Data Lake in a transparent way. This has allowed our customers to make their websites much more responsive (very close to what Amazon does), to make the most of sales opportunities. It is important to detect that a user has a purchase intention and can not make it happen.

We have implemented these projects on the basis of our LogIsland tool, able to analyze logs for both web analytics and cyber security, IoT, and the Factory of the Future (also called Industry 4.0). We can decline this solution over any Big Data infrastructure of Hadoop’s classic customers. There are other possibilities of implementation but ours has the advantage of offering a great ease of use (with turnkey components). It is more extensible (java plugins), free and open source.


We use Kafka for event and alert management and Spark Streaming for real-time analysis of logs or other data. Thanks to its dedicated plugins, LogIsland integrates perfectly with your Big Data Hadoop cluster and your search engine ElasticSearch, without dependency. LogIsland can collect your logs through Nifi or Logstash or any other ETL, dump those logs into your Data Lake, index them for research purposes, analyze them on the fly and turn them into events or alerts. LogIsland can even directly retrieve events by standard IoT protocol or, as we saw above, via a Tag Manager for web analytics. It is scalable thanks to Spark; its performances are incomparable, its architecture offers a huge potential.


Big Data guide
Cookies policy

This site uses cookies to optimize your browsing experience and for audience measurement purposes. By continuing your navigation on this site without changing the settings of your cookies you accept its use.

I agree learn more