I attended XebiCon’18, a conference on data engineering organised by Xebia. It took place at Palais Brongniart, a beautiful place at the center of Paris where the historical Paris stock exchange was located. This annual conference hosted about 1500 people who attended technical talks in eight rooms. Unfortunately, the multi track format forced me to choose a limited list of talks to attend. This blog post provides highlights of some of them. If you are interested in a broader view, the general trends are given in a short blog post I wrote on the website of my company.
Damien Gasparina from Confluent talked about Kafka best practices and common mistakes. If you have only one thing to remember from this talk: take time to read the documentation! Everything is in there and you should have a good understanding of how Kafka works in order to configure it for production.
The default parameters are often not suitable for your needs. For instance, set carefully the values of acks and min.insync.replicas to avoid data loss. I was surprised by the small number of attendees who use Kafka in production who didn't know these two parameters.
He also talked about exactly-once semantics and how to rely on kafka to retry requests automatically in case of failure. This last point is also presented in the blog post Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka by Uber.
Bid Optimization on Google Adwords with Deep Learning
Sandra Pietrowska, Romain Ardiet and Victor Landeau presented how they predicted the costs and gains of buying keywords on Google Adwords. The prediction is based on two scores: one measuring the quality of the keyword and one for the bids (it is a second price auction).
They first implemented a baseline and then considered three models: ARIMA, gradient boosting optimization, and a deep learning model. They decided to focus on the deep learning model because they considered it has a greater margin of improvement.
The "deep learning" model is composed by a reccurent neural network which handles the temporal features, and an artificial neural network for the other features. These two networks are combined as input to a classification layer. They used clustering to train a model (maybe also to change the hyperparameters) for each cluster.
They implemented it with DL4J. They argued it was not possible to use TensorFlow because Docker was not available in their old CentOS linux distribution. I think the main reason is because Xebia is accustomed to use libraries leveraging the JVM.
They used Zeppelin and Google data studio to monitor the models. Ansible was used to deploy the configuration repeatedly since their sandbox was reset every week, to clean everything (to free memory for instance).
Romain Sagean gave a talk on Zeppelin. He used it as a tool to monitor machine learning models. He argued that implementing a dashboard with Zeppelin is easier than Kibana: Kibana requires you to install and maintain the whole Elastic stack whereas Zeppelin is like a Jupyter notebook, but you can hide the code to deliver only the information you want. It comes with built-in visualisation components and connexions to databases. Jupyter notebooks suffer also from memory consumption issues since a notebook does not free the RAM when inactive, and you should therefore restart the kernel. With Zeppelin, you can schedule cron jobs to stop the notebook after a period of inactivity. Another benefit of Zeppelin is the rights management system which offers security out of the box. The possibility to switch between several languages inside the same notebook makes the developpement handy since you can leverage the strengths of each language.
Aurore De Amaral talked about her experience with Spark NLP, a natural language processing library working on Apache Spark. She pointed out a major flaw of this library with a model for named entity recognition: The algorithm based on conditional random fields does not work when the training data is heavier than 2MB and is given in the CoNNL format.
Spark on Kubernetes
Bruno Bouchahoua talked about how to deploy Apache Spark with Kubernetes. Using this scheduler is yet experimental (https://spark.apache.org/docs/latest/running-on-kubernetes.html) and responds to a growing demand: Kubernetes is the trending containers orchestration tool and the big data ecosystem is moving towards supporting it. Along with this deployment mode, he presented the other ways: IAS, PAS and on-premise. Using Spark serverless breaks the data locality foundation popularized by Hadoop. Bruno thinks it is however the way to go: most organisations already have lost the data locality because they rely on cloud providers storage platforms, such as Amazon S3.