Professional software developer with over 14 years of experience working on a variety of challenging projects and technologies ranging from cutting-edge like Big Data, High Load and Machine Learning systems to projects with more traditional stack of technologies. Proficient in most popular modern day Hadoop ecosystem libraries and tools (Storm, Spark, HBase, Kafka) and NoSQL databases.
Personal areas of technical interest are cluster computing, complex event processing and machine learning. Love to take initiative. Broad experience of working off-site as a remote developer as well as on-site (Great Britain, USA, Germany) in English speaking teams. Proficient in English. Open-source contributor.
Private research of approaches for creating high load events storage with real time search capabilities.
- luwak framework: for freetext search on large volume of data with sql interface and w/o spending resources on indexing
- Hbase sep + elastic: for fast async storing, indexing and freetext searching
Senior Software Engineer February 2016 - January 2017
Wargaming.net is the developer and operator of some of the largest online multi-player games ever built such as World of Tanks, World of Warplanes, and World of Warships.
The goal of this system was to:
- track gamers’ online activities;
- recommend unique offers based on their activities;
My task was - move to production usage of different data science prediction models. Spark ml based framework was created to serve different stages of models lifecycle. One of challenges was to adopt data pipeline to dataframes:
- all profiles are stored in hbase,
- also for testing/debugging we need sometime to load data from csv,
- some training data could be obtained from kafka.
I’ve created a number of libs to support data conversion from different data sources to dataframes. For training data filtering spark sql with custom functions was applied.
Core Technologies Used: Spark ML, Spark SQL
My task - port spark based implementation of rules workflow to streamsets api. As a result i’ve created a number of components for modeling processing pipelines in streamsets.
As a side task i’ve implemented unified profile with possibility to modify and store json-like data. This profile allows quick creation of rules and profile storages without explicit schema definition.
Core Technologies Used: Drools, Streamsets.
Initial problem was - port json based data exchange to avro. For this purposes i’ve created avro schema generator for any json. Next step was to support schema modification - it became possible to merge a number of avro schemas in one. And final step - add confluent registry as storage for merged schemas.
Core Technologies Used: Confluent schema registry, scala, avro.
Senior Software Engineer November 2015 - April 2016
Project was dedicated to create prototype of MDM system which based on actual technologies and approaches:
- Graph database as operational backend
- ELK stack as analytics engine
My responsibilities was
- Adding slow query monitoring capabilities to titandb and orientdb (was implemented as db extensions)
- Geo coordinates resolving by postal addresses for displaying on kibana world map dashboards (was implemented as NiFi plugin)
- Data denormalization logic for indexing in elasticsearch complete data objects without nested or parent/child relationships (because of kibana limitations)
The most challenging tasks were:
- Performance optimizations of elasticsearch indexing (winning approach: bulk functionality + compression + index merging switch off + fields mapping optimizations)
- Support of denormalization for large datasets (~10gb) - was implemented using local key value storage mapdb + parallel execution of some “heavy” jobs.
Core Technologies Used: Elastic, Kibana, mapdb, titandb, orientdb, nifi
Senior Software Engineer April 2015 - Jan 2016
IG is a Belarusian software development arm of a large British equities and commodities trader. I was involved in developing an intelligence system for tracking customers’ activities. This is an interesting and technically challenging project that deals primarily with ingesting and processing large amounts of data generated by our customers.
My contributions to the project was as follows:
- development of JSON schema validators (i.e. events are processed using these schemas);
- design and development of ETL jobs for importing data from queue transport with schema versioning;
- design and development of mechanisms for auto-detection of invalid events;
- design and development of components for prediction of client behavior;
The prediction system is the most interesting and involved component, and I will describe it in more detail. The key idea is to predict client’s trading activities based on his or her profile (age, sex, location, etc.). The moment the client registers, the system must apply a set of probabilistic algorithms that rank likelihood of future trading activities of the client. Development and tuning of these algorithms is not a trivial task even when employing the latest machine learning libraries.
To implement the system I chose Spark and implemented the following:
- Model training job. A spark job with random forest algorithm that runs once a day and stores the generated model on HDFS;
- Prediction job. It processes event streams from Kafka and applies produced model to clients’ profiles;
The main challenge of this system is effective collection of training data. It did also take some time to tune processing of clients’ profiles, especially since multiple data sources were involved.
Core Technologies Used: Hadoop (Hortonworks distribution), Spark, Machine Learning and Hadoop Security.
Senior Software Engineer March 2014 - April 2015
The main challenge of this project was proper collection of gamers’ activities. Trivially this could be achieved by forcing game developers to include publishing of user generated events in the code. But it would have required a lot of rework of the existing codebase. Instead we decided to use the database’s replication journal as a source of all events. Our initial proof-of-concept confirmed that it was an excellent approach. During the implementation phase we decided to use Tungsten Replicator for DB journal consumption and Kafka as fast, scalable and reliable event transport. The main challenge for me was the implementation of Tungsten plugin and its performance tuning. But the final result was rewarding: the plugin could read from the journal and send 20k events / sec to Kafka.
The second huge challenge was implementation of a distributed rules engine for processing event streams and subsequent generation of recommendations. During the research phase we tried two solutions:
- Elasticsearch with Percolator API;
- HBase with Drools rules engine;
Because of data consistency problems inherent in Elasticsearch we settled on HBase and Drools option. We also developed Spark streaming jobs for processing journal data. As events arrived, a set of Drools based rules was applied, and the final results were persisted in HBase.
Core Technologies Used: HBase, Spark, Drools.
The goal of this project was to create a Tableau based reporting system for game server administrators that would show network quality within 5 minutes window for each game cluster on a world map.
The game server developers were responsible for developing functionality to publish events, with all the required information, into a Kafka queue. My goal was to develop a Spark streaming job for processing those events. The challenge here was to achieve fast GeoIP resolution. The stream of events could be huge up to100k/sec and GeoIP dataset included 26m ranges. A solution based on a network roundtrip to RDBMS would not work. Trying to cache the whole dataset on each Spark processing node would be expensive because of high memory usage. The solution that I came up with had to do with using MapDB embedded database that keeps only a small index (10mb) in memory while the majority of the data reside on a local disk (4gb). This solution produced 20k IP resolving queries per second for a single Java thread with minimal memory consumption. For data storage we choose HBase and Impala external table for ODBC access from Tableau.
As an interesting side note, I’d like to add that when Impala updated its version to 1.2, I had to develop a patch for Impala to avoid very expensive HBase table analysis introduced by the update. During this project I also created my own implementation of Spark Kafka streaming library and open sourced it. It is available on GitHub here: https://github.com/wgnet/spark-kafka-streaming.
The goal of this project was to load data from Kafka with pre-processing into HDFS for subsequent data post-processing. To implement it, I decided to use LinkedIn’s Camus library and extend it with custom data extraction logic. In this design Kafka becomes a transport of JSON, AVRO, and BSON events. Eventually all these events could be interpreted as a simple DOM tree. I used XPath-like approach for data extraction from the event tree. The job itself was implemented using Spark, and it was written in Scala. The most complex part of this project was partitioning results data into different directories by time. I solved these using MultipleOutputFormat extensions, which means that the data processing takes only one data scan.
At Wargaming, a lot of data on game servers is stored in Python dictionary blobs. My task was to implement Spark jobs and user-defined functions for Python data extraction.
Senior Software Engineer July 2007 - February 2014
I am only listing here projects related to the Big Data world as it is my current area of interest.
The core product that I worked on was a banner network system.
The goal of this project was to collect user activities and feed them to an analytics engine. The system was implemented and deployed in production in 2012. The data flow is as follows and it is largely based on the Lambda architecture pattern:
- Nginx+Lua generates files with activities;
- Python scripts read those files and send events to Scribe;
- Scribe sends data to Flume;
- Flume processes the data using Esper and HBase and generates online data marts;
- Flume persists raw data onto HDFS;
- Batch ETL processes HDFS data and creates offline data marts;
My responsibilities were:
- implementation of Flume Esper transformers and HBase sinks;
- development of HBase co-processors;
The most challenging part had to do with building top lists in real time. It was not so easy to maintain top lists with HTable under high load because of hot-spotting. The solution was to use micro batches and HBase coprocessors.
The goal of this project was to show banners for customers who matched target rules.
My responsibilities were:
- implementation of targeting rules DSL using antlr;
- implementation of a co-processor that returned banners for a given customer profile;
- performance tuning of HBase co-processors;
Belarusian State University. Master of Science in Computer Science. 2002.