Denis Balyka
Email: denikmail@gmail.com
Skype: dzianis.balyka
Professional
software developer with over 14 years of experience working on a variety of
challenging projects and technologies ranging from cutting-edge like Big Data,
High Load and Machine Learning systems to projects with more traditional stack
of technologies. Proficient in most popular modern day Hadoop ecosystem
libraries and tools (Storm, Spark, HBase, Kafka) and NoSQL databases.
Personal
areas of technical interest are cluster computing, complex event processing and
machine learning. Love to take initiative. Broad experience of working off-site
as a remote developer as well as on-site (Great Britain, USA, Germany) in
English speaking teams. Proficient in English. Open-source contributor.
Private
research of approaches for creating high load events storage with real time
search capabilities.
Currently researching:
- luwak framework: for freetext search
on large volume of data with sql interface and w/o spending resources on
indexing
- Hbase sep + elastic: for fast async
storing, indexing and freetext searching
Senior
Software Engineer February 2016
- January 2017
Wargaming.net
is the developer and operator of some of the largest online multi-player games
ever built such as World of Tanks, World of Warplanes, and World of Warships.
The goal of
this system was to:
- track gamers’ online activities;
- recommend unique offers based on
their activities;
My task was
- move to production usage of different data science prediction models. Spark
ml based framework was created to serve different stages of models lifecycle.
One of challenges was to adopt data pipeline to dataframes:
- all profiles are stored in hbase,
- also for testing/debugging we need
sometime to load data from csv,
- some training data could be obtained
from kafka.
I’ve
created a number of libs to support data conversion from different data sources
to dataframes. For training data filtering spark sql with custom functions was
applied.
Core Technologies Used: Spark ML, Spark SQL
My task -
port spark based implementation of rules workflow to streamsets api. As a
result i’ve created a number of components for modeling processing pipelines in
streamsets.
As a side
task i’ve implemented unified profile with possibility to modify and store
json-like data. This profile allows quick creation of rules and profile
storages without explicit schema definition.
Core Technologies Used: Drools, Streamsets.
Initial
problem was - port json based data exchange to avro. For this purposes i’ve
created avro schema generator for any json. Next step was to support schema
modification - it became possible to merge a number of avro schemas in one. And
final step - add confluent registry as storage for merged schemas.
Core Technologies Used: Confluent schema registry, scala,
avro.
Senior
Software Engineer November 2015 -
April 2016
Project was
dedicated to create prototype of MDM system which based on actual technologies
and approaches:
- Graph database as operational backend
- ELK stack as analytics engine
My responsibilities was
- Adding slow query monitoring
capabilities to titandb and orientdb (was implemented as db extensions)
- Geo coordinates resolving by postal
addresses for displaying on kibana world map dashboards (was implemented as
NiFi plugin)
- Data denormalization logic for
indexing in elasticsearch complete data objects without nested or parent/child
relationships (because of kibana limitations)
The most
challenging tasks were:
- Performance optimizations of
elasticsearch indexing (winning approach: bulk functionality + compression +
index merging switch off + fields mapping optimizations)
- Support of denormalization for large
datasets (~10gb) - was implemented using local key value storage mapdb +
parallel execution of some “heavy” jobs.
Core Technologies Used: Elastic, Kibana, mapdb, titandb,
orientdb, nifi
Senior
Software Engineer April 2015 -
Jan 2016
IG is a
Belarusian software development arm of a large British equities and commodities
trader. I was involved in developing an intelligence system for tracking
customers’ activities. This is an interesting and technically challenging
project that deals primarily with ingesting and processing large amounts of
data generated by our customers.
My
contributions to the project was as follows:
- development of JSON schema
validators (i.e. events are processed using these schemas);
- design and development of ETL jobs
for importing data from queue transport with schema versioning;
- design and development of mechanisms
for auto-detection of invalid events;
- design and development of components
for prediction of client behavior;
The
prediction system is the most interesting and involved component, and I will
describe it in more detail. The key idea is to predict client’s trading
activities based on his or her profile (age, sex, location, etc.). The moment
the client registers, the system must apply a set of probabilistic algorithms
that rank likelihood of future trading activities of the client. Development
and tuning of these algorithms is not a trivial task even when employing the
latest machine learning libraries.
To
implement the system I chose Spark and implemented the following:
- Model training job. A spark job with
random forest algorithm that runs once a day and stores the generated model on
HDFS;
- Prediction job. It processes event
streams from Kafka and applies produced model to clients’ profiles;
The main
challenge of this system is effective collection of training data. It did also
take some time to tune processing of clients’ profiles, especially since
multiple data sources were involved.
Core Technologies Used: Hadoop (Hortonworks distribution),
Spark, Machine Learning and Hadoop Security.
Senior
Software Engineer March 2014 -
April 2015
The main
challenge of this project was proper collection of gamers’ activities.
Trivially this could be achieved by forcing game developers to include
publishing of user generated events in the code. But it would have required a
lot of rework of the existing codebase. Instead we decided to use the database’s
replication journal as a source of all events. Our initial proof-of-concept
confirmed that it was an excellent approach. During the implementation phase we
decided to use Tungsten Replicator for DB journal consumption and Kafka as
fast, scalable and reliable event transport. The main challenge for me was the
implementation of Tungsten plugin and its performance tuning. But the final
result was rewarding: the plugin could read from the journal and send 20k
events / sec to Kafka.
The second huge challenge was implementation of a distributed rules engine for processing event streams and subsequent generation of recommendations. During the research phase we tried two solutions:
- Elasticsearch with Percolator API;
- HBase with Drools rules engine;
Because of
data consistency problems inherent in Elasticsearch we settled on HBase and
Drools option. We also developed Spark streaming jobs for processing journal
data. As events arrived, a set of Drools based rules was applied, and the final
results were persisted in HBase.
Core Technologies Used: HBase, Spark, Drools.
The goal of
this project was to create a Tableau based reporting system for game server
administrators that would show network quality within 5 minutes window for each
game cluster on a world map.
The game
server developers were responsible for developing functionality to publish
events, with all the required information, into a Kafka queue. My goal was to
develop a Spark streaming job for processing those events. The challenge here
was to achieve fast GeoIP resolution. The stream of events could be huge up
to100k/sec and GeoIP dataset included 26m ranges. A solution based on a network
roundtrip to RDBMS would not work. Trying to cache the whole dataset on each
Spark processing node would be expensive because of high memory usage. The
solution that I came up with had to do with using MapDB embedded database that
keeps only a small index (10mb) in memory while the majority of the data reside
on a local disk (4gb). This solution produced 20k IP resolving queries per
second for a single Java thread with minimal memory consumption. For data
storage we choose HBase and Impala external table for ODBC access from Tableau.
As an
interesting side note, I’d like to add that when Impala updated its version to
1.2, I had to develop a patch for Impala to avoid very expensive HBase table
analysis introduced by the update. During this project I also created my own
implementation of Spark Kafka streaming library and open sourced it. It is available
on GitHub here: https://github.com/wgnet/spark-kafka-streaming.
The goal of
this project was to load data from Kafka with pre-processing into HDFS for
subsequent data post-processing. To implement it, I decided to use LinkedIn’s
Camus library and extend it with custom data extraction logic. In this design
Kafka becomes a transport of JSON, AVRO, and BSON events. Eventually all these
events could be interpreted as a simple DOM tree. I used XPath-like approach
for data extraction from the event tree. The job itself was implemented using
Spark, and it was written in Scala. The most complex part of this project was
partitioning results data into different directories by time. I solved these
using MultipleOutputFormat extensions, which means that the data processing
takes only one data scan.
At
Wargaming, a lot of data on game servers is stored in Python dictionary blobs.
My task was to implement Spark jobs and user-defined functions for Python data
extraction.
Senior
Software Engineer July 2007 -
February 2014
I am only
listing here projects related to the Big Data world as it is my current area of
interest.
The core
product that I worked on was a banner network system.
The goal of
this project was to collect user activities and feed them to an analytics
engine. The system was implemented and deployed in production in 2012. The data
flow is as follows and it is largely based on the Lambda architecture pattern:
- Nginx+Lua generates files with
activities;
- Python scripts read those files and
send events to Scribe;
- Scribe sends data to Flume;
- Flume processes the data using Esper
and HBase and generates online data marts;
- Flume persists raw data onto HDFS;
- Batch ETL processes HDFS data and
creates offline data marts;
My responsibilities were:
- implementation of Flume Esper
transformers and HBase sinks;
- development of HBase co-processors;
The most
challenging part had to do with building top lists in real time. It was not so
easy to maintain top lists with HTable under high load because of hot-spotting.
The solution was to use micro batches and HBase coprocessors.
The goal of
this project was to show banners for customers who matched target rules.
My responsibilities were:
- implementation of targeting rules
DSL using antlr;
- implementation of a co-processor
that returned banners for a given customer profile;
- design and implementation of a
decision tree with HBase storage;
Challenges
- performance tuning of HBase
co-processors;
Belarusian State University. Master of Science in Computer Science. 2002.