Visualization using d3.js and DataMaps to visualize attack vector within given network. Application takes data from IBM QRadar and with help of RESTful API sent query into d3.js, where it geocodes to lat-ling and places into the map. Lines are colorized and filtered on front-end for the different level of attack points.
For the current visualization we’ve used:
With the world map we’ve created additional layer with GeoJSON encoding China provinces (including Taiwan). Next, we’ve created additional layer with data, associated with country/province from the base map layer. Then applied the chloroplet to get the visual color distribution across the map.
The main goal is to get similar functionality with Vizabi library in NVD3. For front-end purpose we’ve used Angular.js and modified original NVD3 code to deliver such features:
Task:
The company asked to do Data Analysis, data cleaning, data transforming and decision making by using MySQL database.
Solution:
Task:
Build a classifier that can find the localization site of a protein in yeast, based on 8 attributes (features).
Solution:
For performing classification there was constructed a 3-layer artifical neural network (ANN) and specially a feed-forward multilayer perception. We have used a stochastic gradient descent with back-propagation to train our ANN.
Task:
Build simple web solution that displays realtime data using Cubism.js with Google Big Query database as datasource.
Solution:
We have built simple authorization, API for working with BigQuery, front-end controller for managing multiple graphs and simple static cubism.js graphs.
Type | Technology |
---|---|
Data visualization, Data engineering, Devops |
Django, Angularjs, Cubism.js, D3.js, Django rest framework, Google big query api |
Task:
Build data scrapper from the certain website and export data to .csv.
Solution:
We used Scrapy framework for extracting the data from websites.
Task:
Develop d3 graph to display paid / unpaid work for men / women depending on the country / countries selected.
Solution:
We used NWD3 collection of components for d3.js, added responsiveness and cross-browser support.
Task:
Create a reference implementation of Spark MLLib in churn modeling project.
Solution:
We used Mahout Random Forest Classification and Spark MLLib Random Forest Regressions to predict the probability of churn of customers for online-shop.
Task:
Develop web app for predicting phenotypic and environmental characteristics of gram-negative bacterium Escherichia coli (dataset contains 4502 features, the first 6 corresponding to gene ID, strain, medium, environmental and genetic perturbation, and information about the growth rate; the last entries correspond to the expression of all genes in the bacterium).
Solution:
In this project we use a set of 223 transcriptional profiling samples from the gram-negative bacterium Escherichiacoli, which is the well-studied organism with great importance to human healthand biotechnology. We created a predictor of the bacterial growth attribute by using only the expression of the genes as attributes and use a regularized regression technique lasso. Program reports the confidence interval of the prediction by using the bootstrapping method. We created four separate SVM classifiers to categorize the strain type, medium type, environmental and gene perturbation,given all the gene transcriptional profiles and create one composite SVM classifier to simultaneously predict medium and environmental perturbations. This classifier performs worse that the two individual classifiers together for these predictions. And finally the program performs of Principal Component Analysis,keeping only 3 Principal Components as features for the SVM classifier.
Type | Technology |
---|---|
Data science, Data analysis, Machine learning |
Matlab, Bootstrapping, Regularisation, Support vector machine |
Task:
Develop Prediction Model for webspam and hyperlink analysis designed and trained (with provided data) to achieve certain prediction goals.
Solution:
We have built model for Spam\No-spam prediction for links analysis company. We have used Big Data methods for data size of 70+ Gb. There were a lot of text features, which were preprocessed by using TF-IDF, Word2Vec and Features Selecting methods. Various data cleaning and ETL methods were applied. As result we have built classifier model, which was deployed as a RESTful web service
Type | Technology |
---|---|
Data analysis, Data science, Machine learning, Big data |
Python, Flask, Django, Mllib, Spark |
Task:
We had to visualize data from sensors data, stored in MongoDB with help of D3.js visualization library in accordance to user choise made on the dashboard.
Solution:
We took data from database, applied filters on it and presented it in the web UI dashboard.
Task:
To solve the image captchas (recognize numbers and letters from image)
Solution:
We used Machine learning methods in order to solve the captcha images, during the process we wrote program which dissect letters and numbers from the image file and then machine learning classifier applied. We had dataset for training, which contained images with captchas and correct answers. We have used it for training the model. Then we applied the model to test dataset (only images) and have got the results with answers.
Task:
The task was to connect to customer’s ticket database through the Zendesk API and generate a dataset for a two-series line chart widget in Geckoboard. The first series of the chart will be tickets created by week. The second series will be tickets solved per week. Data needs to be generated for a rolling 52-week period if possible
Solution:
To solve this task we got data from Zendesk API about amount of solved and created tickets per each week . Then processed data in Python, created push-widget in Geckoboard and pushed data to widget via API.
Task:
Solution:
Our main steps were to establish proper big data pipeline:
Task:
The main goal was to create web app for optimizing products prices from Shopify store data. User can sign up using his shopify account, choose products and optimize their prices. And have access to profit/revenue charts.
Solution:
We used authentification via Shopify API (user gave access to his account on shopify.com), Then our app imported all products and offers from Shopify and optimized products prices. App reoptimizes prices every day using Celery and displayes revenue/profit line-chart using d3.js for each product. It also saves yesterday data in MySQL DB and track product visitors using tracking pixel and saves them in DB.
Type | Technology | Demo |
---|---|---|
Data analysis, Data science, Data visualization, Data engineering |
Flask, Sqlalchemy, Python, D3.js, Shopify api, Mysql, Celery, Angularjs |
Task:
Automatic scoring of the MMA fights according to the draftkings.com rules.
Solution:
We used Python to develop script with algorithm for scoring calculation.
Task:
Сonnect to VMware vCenter, collect virtual machine data and send it to a specified URL in JSON or XML file.
Solution:
We used pyVmomi library and Python scripts to complete task.
Task:
Display a given metric/statistic on each country, on different historical periods. The metric shall be aggregated accordingly to the zoom level (e.g.. World view, continent, region, country). Besides the actual numerical value of the metrics (or aggregation), the statistic shall be display by painting all the countries with different intensities of a color, to be defined in runtime as well.
Solution:
We generated JSON file with hierarchical structure of world from existing TopoJSON. To create invisible shapes of region, we take country shapes and used topojson.mesh(). The same was for continents . “jquery.anythingzoomer.js” was used to apply a zooming glass to the map.
Task:
To create chart chart showing total forms submitted by week, grouped by lead source, trended over the last eight weeks. Combined with next bullet so that interactivity is week vs. month, as well as select/deselect lead sources to declutter chart if desired. Also, average form submissions by lead source are calculated based on week/month interactivity toggle.
D3 code generating chart showing total forms submitted by month, trended over the last six months.
D3 code generating chart showing forms submitted last week (Sun-Sat), grouped by lead source. Code should be based on current Week - 1 so that it can be run against each week's data without code changes.
Solution:
We used Python for data transforming , data aggregation, tables creating and group data by date. Then we used D3.js for X and Y axis defining, legend adding and values formatting.
Task:
Build a simulator for inner use which predicts label of time series data.
Solution:
We created a console script written on R and Bash for production purposes which validates predictive models in specific iterative customer defined way. It comprizes iterative data splitting, teaching model,predicting outcomes and evaluation of the model performance.
Task:
The goal of the project was to predict incomes for the products eCommerce transactional data.
Solution:
Whole data was separated into 7 subsets with categories associated with different income levels. For each subset different SVM models were trained. Predictions were done accordingly, using different models for different categories in the test dataset.
Task:
Build a force directed graph with the following interactive features:
1) ability to click on a node and have the graph recenter on that node;
2) Ability to filter on Node Definition.
3) ability to filter edges on Relationship.
Task:
Develop a prediction algorithm optimization in Python, including algorithm benchmarking before and after tailoring the random forest algorithm.
Solution:
We have completed these main steps: features engineering and normalization, different models (Logistic Regression, SVM, Decision Trees, Ensemble models etc), model optimizations (via grid search, features importances searching, pearson correlation calculating, cross validation, graph analysis, overfitting detecting), dataset balancing and model scoring.
Task:
To run Monte-Carlo simulations by using Big Data methods and Apache Spark.
Solution:
We have built Monte-Carlo simulator with high speed, running on the big cluster (100 nodes) and processing in parallel way with 1000 outputs.
Task:
The task was to extract the data from Facebook, Twitter and LinkedIn and put it on the interactive map in a dashboard, and add ability to filter each social media by pushing one of the social medias Icons..
Solution:
As input we took data from LinkedIn and FB API (page fans and feed). As output we have displayed fans by country using Google Maps API and js-marker-cluster ; displayed top-3 countries using radialProgress.js; displayed Linkedin and FB comments.
Task:
To create an web service for searching of the open vacancies by some category or job title and by worker or company location. It should work by the following way:
Solution:
We've built a web scraper to scrape vacancies from popular job portals, built the database architecture which contains connection between users CV and scraped vacancies, converted data from upload CV files (in doc, .docx and .pdf format) to .txt and transferred converted and cleaned CV data to the server with installed ML algorithm for finding the appropriate vacancies and their displaying on the web page.
Task:
The goal is to create the read \ write system that takes the data from US data network provide to the HDFS using Spark Streaming.
Solution:
The input data represent a set of files that contain logs in certain form. Based on the content of each log it is possible to determine many parameters characterizing the log, particularly, subscriber_id. These impressions are then aggregated by Spark Streaming into a data warehouse. Spark Streaming can be preceded by Apache Kafka Streaming if it is demanded to accelerate and/or improve processing, transformation and further data transfer. At this stage (aggregation using Spark) the log data are joining on subscriber id. During this, all the files collect in a 15 minute interval, which is controlled by config file. The Spark Streaming application create the files in a new directory on each batch window. The aggregated data write to HDFS and copied to the OSP as gzipped files (performing a multi-threaded writing process). The most demand method to write and copy result data to HDFS & OSP is the using of SFTP or SCP, but it should be tested.
Task:
Visualize specified data in JSON format which shows percentages of objectives completed in three different categories.
Solution:
We used NVD3 libraries to visualize data on graphs.
Task:
To identify sequences of events that are more predictive than their part worth components.
Solution:
We was be able to determine that a model that incorporates sequence (L-C->X) is more predictive than just including L, C and X in the model. As result we have get the patterns with the highest frequency and found the persent how often they appiers in datasets.
Task:
1) to write the programm for Extreme Machine Learning algorithms;
2) create Scala API for Apache Spark with Classification and Regression for Exreme ML.
As resut we have two API in Scala for apache Spark with Regression and Classification algorithms by using Extreme Machine Learning. The main advantages are:
- high speed of processing;
- high accuracy;
- unique script.
Task:
To classify the tweets messages by Life Events.
The key strategy adopted in this work is to obtain a relatively clean training dataset from large quantity of Twitter data by relying on minimum efforts of human supervision, and sometimes is at the sacrifice of recall. To achieve this goal, we rely on a couple of restrictions and manual screenings, such as relying on replies, LDA topic identification and seed screening. Each part of system depends on the early steps. As result we have got the class with Life Event for each twitter message.
Task:
To develop algorithm for prediction of the success in winning of the tenders for construction company.
Steps for achieving the goal were:
- Preparing and clining historical data
- Missing values filling
- Data normalization
- Model building
- Features importances searching
- Model evalution
- Win prediction
- Probabilities searching
Task:
To create a SVM classifier that predicts in which class (interesting/uninteresting) incoming articles fall. The SVM algorithms is modelled using text concept features based on LSA or LDA.
We have used Latent Semantic Analysis (LSA) or Latent Dirichlet allocation (LDA) to get hidden concepts of text articles. We did this for articles a user have read (reads) and articles a user hasn’t read (unreads). We have used a Support Vector Machine (SVM) to create a classification model that can be used to predict in which class a new, incoming article falls.
Task:
To investigate the economic incentives to commit review fraud on the popular review platform Yelp, using two complementary approaches and datasets.
Our analysis suggests that a business is more likely to commit positive review fraud when its reputation is weak.
Reviews with extreme ratings are more likely to be filtered – all else equal, 1 and 5star review are roughly 3 percentage points more likely to be filtered than 3star reviews.
Task:
To implement a simple Data Process in order to build Graph from Matrices Calculation and visualize it.
This Project aims to grab a data source in Input (CSV input), use it to calculate with Matrices, the structure of the Graph and then import this structure in Graph Database (Neo4j).
We have built Matrix from CSV files. Than visualized Matrix for words and group them by forms and built two type of graphs in Neo4j.
Task:
To crawl online-shops for coupons data.
We used scrapinghub for easily deploy crawlers and scale them on demand – without needing to worry about servers, monitoring, backups, or cron jobs. And used Crawlera which prevents IP Bans and managing thousands of proxies internally so it’s allows to crawl quickly and reliably.
Task:
To implement a simple Data Process in order to build Graph from Matrices Calculation and visualize it.
We have built Matrix from CSV files. Than visualized Matrix for words and group them by forms and built two type of graphs in Neo4j.
Task:
The task was to:
1. transform and clean data about a census in Canada;
2. data analysis for age, gender and race categories, social status, etc.;
3. categories and business ranks calculation;
4. analysis of cities data by population and ranks.
Our main steps were to establish proper big data pipeline:
1. reading and processing of huge CSV files;
2. transformation and cleaning of the data;
3. data extraction;
4. understanding and visualization of the data for age, gender and race categories, social statuses distributions;
5. calculation of categories and business ranks.
Task:
We have tdms files generated out of sound which is in binary format, also we have python program to convert that program into graph.
What we are looking for is to upload multiple tdms files into HDFS, proccess the same using python using Spark.
Main steps were:
1. Create AWS cluster
2. Load tdms files into the AWS cluster
3. Convert Python reader into Pyspark
4. Clean the data with RDDs
5. Transform data to graph building
6. Define properties of graph
7. Building graph
8. Create tables and store data analysis
9. Convert the results to needed format
10. send an email alert to user based on data thresholds
11. Ensure all the source data, intermediate data, results, graphs are stored in cluster
Task:
On this project we had 2 tasks:
1. To develop an algorithm that will search the best matches of some input in the list of canonical inputs. Each input represents a job title in canonical or any incorrect form. This algorithm should be sensitive to acronyms and be able to find seniorities (like senior, junior, etc.) in an input job title. The measure of good work of the algorithm is the similarity score between input and canonical job titles.
2. To create a simple Web application which contains an input field where the type his query and returns the list of three canonical job titles with similarity score.
Our main steps were to establish proper big data pipeline:
1. get through AJAX from front-end the input phrase;
2. canonization of the input phrase:
2.1. punctuation removing;
2.2. lowercase;
2.3. stop words removing;
2.4. unicode processing;
3. acronyms replacement;
4. seniority detection and replacement where in is necessarily;
5. determining of preferable metrics and phonetic algorithms;
6. similarity score calculation based on a specific combination of above algorithms;
7. use-cases processing;
8. finding of top three the most similar canonical job titles;
9. display these results on the web page.
Task:
Create informative solution for the MOOC platform to analyze the students progress during his education.
The main dashboard shows the mistakes made by each student grouped by class (for ex., English Literature) and type (for ex., Grammar, Spelling). Each type can by break down for the module (for ex., Literacy. L.1.1.C). Along with cheking the individual student progress, you can see the chart of making mistakes in time and see the difference in %% and value.
Task:
Build a web application with various of dashboards to visualize daily data taken from APIs.
We created separate collection for such kind of data:
- using greenlet requests to send asynchronous requests to clicky, and celery to schedule refreshes for each 5 minutes;
- as a result, 20-30 second is required to get data about 1k sites, which is much (more than 20x) faster than before. As requests are executing asynchronous, so increasing an amount of sites won’t change the execution time more than 2 or 3 times (60-90 seconds). We created an index with the key search fields, in our case, it was date and URL of the site: db.adzilla.createIndex({“date”:1, “URL”:1}) it helped to increase speed up to 3x times query modification
- using the $project to decrease the amount of fields in a single row to leave only required data;
- refactoring query by placing $match command before any operations;
- divided some complicated queries to a few simpler;
- recomposing queries to put the $limit and $sort operations as earlier as possible;
- using python instead queries if it is efficient and possible.
Type | Technology |
---|---|
Data science, Data visualization, Data analysis, Big data |
D3.js, Angularjs, Flask, Python, Mongodb |
Task:
To produce a detailed consulting report advising the company on the opportunities available to them and the best way to implement opportunities in business, from concept to reality.
Solution:
Task:
Implement compatibility api between users based on their MBTI( Myers–Briggs Type Indicator), numerology, zodiac sign, chinese zodiac sign, interests data.
Solution:
Task:
To create a zoomable bar chart with such features:
1. A single web page that will get data from a SQL view
2. Ability to zoom in and out
3. Sort by Project Name and Type
Solution:
As a result we've implemented following features:
- Read and parse input xlsx file
- Draw Horizontal bar chart with timeline
- Different bar categories and legend
- Filters by program type, project title
- Sort by program type, project title
- Drag + Zoom
Task:
To conduct machine learning experiment using Azure with 1500 inputs, expected to have minimum of 20 features each.
Solution:
We have implemented following logic:
1. Convert Excel file to csv format and upload it into AzureML
2. Extract all columns to the left of the Result Rate (PHP) for further prediction
3. Use python script to translate all string values into integer, because Linear Regression and Boosted Decision Tree models can use only numeric values
4. Replace all missing values with 0
5. Split dataset into train set and test set for cross-validation
6. Create Linear Regression and Boosted Decision Tree models
7. Train this two model with train data 8. Score trained models with test data
Task:
To implement collision detection and correction for 2 graphs which are already built with D3.js in JavaScript.
Solution:
Compiling ecmascript 6 in the browser - http://henryzoo.com/babel.github.io/docs/usage/browser/
Collision detection:
- create quadtree object
- call visit method on each item https://github.com/d3/d3-3.x-api-reference/blob/master/Quadtree-Geom.md#visit to check whether it have collision
- define collide method that change x,y position if items are overlaped
Task:
The company, which has a lot of raw test information, asked us to extract Entities from different categories of text:
- messages,
- reviews,
- articles.
The next step is making Sentiment Analysis for this text.
Solution:
Action procedure was:
1. Raw data preprocessing
2. Keywords selection
3. Synonims detection
4. Categories for each keywords search
5. Sentiment analysis
6. The working script tests preparing and polishing
Task:
To build an automated tool that extracts from a set of UN publications all the messages that relate to the relationships between urban development (SDG 11), and all the other SDG areas, and then visualize the results.
Solution:
We have followed the next algorithm:
1. Extract text data from pdf
2. Add missing punctuation to ease splitting by sentences
3. Split text by sentences
4. For every sentence:
4.1. Apply lemmatizer and stemmer to sentence and keywords to get base form of the words
4.2. Search for SDG keywords in sentence
4.3. Add all found matches to the result list
5. Classify sentences from result list by 3 types: causal, constraint and recommendation. And detect direction (A causes B).
Task:
Download the public dataset from openFDA community and prepare them for the ElasticSearch usage.
Solution:
We use Pythom script that make calls to the openFDA and grab the files from the ftp storage. Then we unzip the JSON files, convert to the ElasticSearch readeable JSON formart (operation of transformation with predefined schema) and put the results on the Amazon S3 bucket.
Task:
The company is working on physiological time-series signals ( eg. ECG, EEG ). The asked to do DTW, clustering and classification for Time Series data. Since, time-series signals like ECG or EEG are continuous in nature, there we need large scale distributed and parallel processing using Apache Spark.
Solution:
We have created the algorithm for calculating the distance between pairs of Time Series in parallel way by using next methods:
- Naive DTW algorithm
- Locality Contraint DTW algorithm
- LB_Keogh DTW Algorithm
Task:
The regression analysis task is a part of research of the Twitter and Youtube users. The task was to work with existing regression model (linear regression of three variables) and justify the selection of variables, justify why log(score) is selected instead of score, justify why linear is better compared with other forms of regression (logistic, non-linear) and compare it with other methods.
Solution:
Task:
We had to:
1. Develop custom MCF Attribution Comparison Table for data pulled from GA to Klipfolio
2. Develop ecommerce product performance report for the same data
Solution:
Table 1:Custom MCF Attribution Comparison Table. The data is pulled from the Google Analytics Conversion -> Attribution -> ModelComparison Tool
Table2: Ecommerce product performance report. Data for table from Google Analytics reportconversions -> ecommerce -> product performance (metrics: product revenue, transactions,product category (3x products), default channel grouping)
Task:
1. create multi-user access to the application;
2. collect logs from the remote server;
3. visualize logs data in D3.js graphs.
Solution:
- HTML code with the application and needed JS;
- Bitbucket source code;
- Deployed application on client server (additional effort);
- This slides as a documentation.
Task:
- upload CSVs to Tableau Online and make dashboards
- make a sample visual report that demonstrates how Tableau can display information that is drawn from different spreadsheets.
Solution:
We have prepared the dashboards, provided the consultation and step-by-step manual how to work in Tableau.
Task:
Draw 2 graphs, one in the upper part of the canvase, the second in the lower part. Each graph is a star. Each graph vertex represents a word.
Solution:
The corresponding vertices of 2 graphs are connected with lines.
Task:
To choose few APIs for a prototype, discribe why and how are they different from each other in facial recognition and sentiment analysis.
Solution:
We prapared the presentation, in which we analysed current APIs for facial recognition and sentiment analysis. The presentation included the most useful APIs, their description, main features and comparative analysis.
Task:
To make customer segmentation by using ecommerce ML methods.
Solution:
We have built the engine for recommendations creating for each customer. It is based on item:
Task:
To apply Data Mining methods to data about articles in Chinese Universities and build Chart with data visualization.
Solution:
We have created two visualizations for Chinese universities. In the "Chinese_universities_chord" we chose top 20 universities with intra-collaborations >=4 and joined their relations with other universities. As result we have got 34 universities and done visualization for them.
Task:
To predict the person’s age and the person’s gender by using their information from Facebook account. Some of fields are empty and some have text or number values.
Solution:
To achieve the goal we've used such methods:
- Nan values filling
- Features normalization
- Grid Search
- Feature importances searching
- Logistic regression
- Pearson correlation coefficient searching
- Model examination
- Cross validation
Task:
The task was to create dashboard with graphs to visualize the results of surveys.
Solution:
We have created the dasboard, added auhentification and commentig system.
Task:
To add inverse sorting order function in animal matrix.
Task:
To build a dashboard that allow to provide information to the e-tailers based on the users interactions on the widget.
Solution:
We have build the app and dashboard by using AngularJS and D3.js.
Task:
- Map products Brand’s names to the same format;
- Map different Color’s names to 10 main colors;
- Improve free search on the e-commerce site.
Solution:
• Create mappings for common variations of brand names;
• Map products without brand information based on title and description;
• Support alternative spellings for complicated brand names (Ferragamo, Furstenburg).
Task:
The task was help to create a model to predict which website visitors are likely to churn within the next two months using internal customer data and website visit data.
Solution:
We created a model of customer churn with detailed instruction and tools for churn prediction.
Type | Technology |
---|---|
Machine learning, Data science |
Decision trees, Random forest trees, Logistic regression |
Task:
To make analysis of different FSL tool and their evaluation.
Solution:
We used an example data to explain how the different tools in FSL can be run. Most of the same operations are used for the analysis of human FMRI data.
Task:
To create a classifier that takes as input one such match record, with the "winner" field left out, and labels this record as a "win" or a "loss".
Solution:
Our main steps were:
- Load train and test data
- Replace all NaN values with 0
- Sort arrays by column
- Split train array into sTrain and sTest arrays for model train and check
- Split sTrain and sTest arrays into arrays that contains only ‘winner’ column and arrays that contain all but ‘winner’
- Create RandomForestClassifier model
- Train our model with splited data from sTrain
- Predict results of sTest data without ‘winner’
- Compare predicted data with sTest ‘winner’ data
- Calculate AUC and Confusion Matrix
- Predict ‘winner’ from test data
- Write result of prediction into file
Task:
We've used the financial data for programs / departments of colleges to build visualization of a summary data of these colleges (the summary budgets of all the programs, etc.) with the ability to view detailed data (each program, etc.) in a single year.
Solution:
We've used Tableau platform to build visualizations, that shows changes of budgets in different colleges and for different programs in 2011-2015 with the ability to filter college,program, year. We've also added FOAP comparison, revenue and expence changes charts. Now we are waiting for the data in 2016 year.
Task:
The task was to create a spark app that runs a timeseries algorithm on data.
Solution:
Our main steps was:
- Getting the data from Cassandra.
- Cleaning the data.
- Building Apache Spark Streaming.
- Calculating the main values: number of likes, comments, reads, shares and the speed of each article.
- Building SVM model for performance prediction.
- Return the statistic results to the Cassandra.
Task:
To build a custom SEO reporting using data from Google Analytics and Google search console.
Solution:
We have build such graphs and visuals:
1) Channel breakdown [table] - showing metrics -> sessions, bounce rate, conversions and conversion rate for each showing 3 months and YoY (Google Analytics)
2) Chart showing number of conversions vs conversion rate for organic over 12 months (Google Analytics)
3) Chart showing Clicks & Avg. position over time (Google search console)
4) Table showing top non-brand search queries clicks (top 30)
5) Top ranking keywords in chart (grouped by positions 1-3 | 4 - 10 | 10 - 20 | Over 30) Google Search Console
6) Number of links over time (Google Search Console)
7) Chart showing number of landing pages drawing organic traffic overtime (google search console)
8) Mobile VS Desktop sessions and conversions over time chart (google analytics)
Task:
The task was to convert existing SQL Server database on Windows to Titan on CentOS and write scripts for data migration.
Solution:
We have installed http://titan.thinkaurelius.com/ , it's Graph database running on top of Cassandra. We have made script for Gremlin console http://s3.thinkaurelius.com/docs/titan/0.5.0/gremlin.html
Task:
To build an application for the Gas to the future project. The app is the part of the site including diagrams and graphs.
Solution:
The main part of the project was building the Sankey Diagram. So the goal is the Sankey updates based on the filters applied to it. The idea is that you apply one of the 'continuous filters' , which filters out various lines; then generates JSON , which gets passed to the Sankey.
Task:
Main problem in the task was to transform data using client's algorithms to format which can be used to draw stacked bar graph.
Solution:
Main transformations was made with nest map and stack functions. After all data transformations and drawing we’ve got a final stacked bar graph.
Task:
1) What is the estimated lifetime value for each Industry by yeaк? Apply various approaches for LTV calculation
2) Can you map the lifecycle of the accounts grouped by year?
3) Tracking customer behaviour by year of registration
Solution:
Our main steps were:
1. Dataset analysis
2. Calculation of active, new, gone accounts per each year
3. Calculation of Customer Retention Rate and Customer profit for each industry
4. LTV calculation based on simple approaches
5. Segmentation of users, industries by users’ activities and frequencies, build Frequency/Recency matrix
6. Estimate the LTV using the Gamma-Gamma submodel
7. Visualize customer transaction history
8. Churn prediction and calculation the LTV based on churn
Task:
To scrapy data from http://autotrader.com and http://autotrader.ca every week.
Solution:
Steps:
Task:
The task was to create calendar zoomer and save calendar state after zoom changing.
Solution:
We have done:
- responsive d3 calendar
- mobile pinch, pan
- zooming
- saving calendar state between zoom switch
Task:
The task had a research background. Architectures for a real-time intrusion detection system (IDS) using a powerful big data technology Apache Spark together with Spark Streaming, Spark MLlib, Apache Kafka and Hbase / Cassandra should be developed. For the aim of capabilities, obtained results and performance comparison, the Naive Bayesian classifier was used in both the stack of the mentioned above tools and in the stack of Apache Hadoop, HStreaming and Apache Hive. For attack type prediction in KDD’99, NSL-KDD, CSIC 2010 HTTP, UNB ISCX, DARPA datasets were used few ML algorithms such as Naive Bayes, random forest, logistic regression, gradient boosting trees, SVM, etc. provided by Spark MLlib together with the best ML data processing and training techniques for selection the best model for a dataset. The proposed system architecture is evaluated with respect to accuracy in terms of true positive (TP) and false positive (FP), with respect to efficiency in terms of processing time and by comparing results with traditional techniques. The bunch of Apache Kafka and Spark Streaming serves as a distributed, fault tolerant, real time big data stream processor. Results of prediction (classification) were saved in Hbase for generating of the IDS’s work statistics. A web-based management console renders a lot of visualizations using D3.js library that help users to quickly assess threats and analyze network traffic.
Solution:
Our main steps were to establish proper big data pipeline:
1. Investigate the current stage in the field of problem statement.
2. Define conditions for implementing of a real-time application.
3. Choose stacks of technologies.
4. Build architectures.
5. Prepare ways of architecture implementations and approaches comparison.
Type | Technology |
---|---|
Data analysis, Data science, Big data, Machine learning |
Python, Scala, Spark, Apache kafka, Apache hadoop, Apache hbase, Apache cassandra, D3.js, Apache hive |
Task:
To improve Telegram bot.
Solution:
The solution was to solve 3 issues in chatbot.
Task:
Obtain relevant business and data insights from an SQL database that contains customer and purchase data for ecommerce organization.
Solution:
We have built custom queries and took demographic data, purchasing patterns, product performance from local MS MySQL database.
Task:
To build a simple streaming application that analyzes Twitter data.
Solution:
We built the app which capture and process the live Twitter data stream and do the following tasks:
• capturing the live data
• setup a stream processing pipeline
• process and get insights
• store the final processing results to a relational database management
Task:
The main goal was to write the program for text encoding and errors checking.
Solution:
We was able to finish such part of project:
- Write a program to compute the interval corresponding to a given word using arithmetic encoding.
- Building a Hidden Markov model for encoding
- Write the program for String Edit Distance calculating
Task:
To create the automative extractor of Twitter data.
Solution:
Our main steps were:
- create script that will login as user, move to specific page and download CSV;
- scrap the data;
- save to database as snapshot.
Task:
To get data from Facebook API, record it in the Excel file and count the difference with the previous data.
Task:
Algorithm Requirements
1. Improve NLP and extraction tools used in code.
2. Algorithm will analyze digital textbooks and extract key information from each chapter.
3. Chapter will be properly labeled and separated , matching the content in the digital book.
4. Algorithm will be able to read inputs (DOC TXT PDF ePUB)
5. End document (output) will act like a study guide, covering the key concepts from each chapter. (DOC) Software Integration and Development
1) to make the integration easier:
- use the cloud hosting (amazon);
- change the site CMS from Squarespace to MODX
Solution:
Z-Study Buddy uses Artificial Intelligence to mechanically read digital textbooks. Technology has read and analyzed study textbooks, and extract the key information from each chapter and put them on a summary document.
Task:
The task was to create the proposal to suggest the solution of next tasks:
Solution:
We have created a document with a proposal that solves the problem and address risk.
Task:
To crawl 10M geotagged data from Flickr / Instagram / Twitter to do a data visualization on the map.
Solution:
We have implemented the script of getting photos by geotags (the square) from website https://www.flickr.com/
Task:
- set up client NiFi server NiFi and encrypt it
- set up Kafka + Spark Streaming on Azure
- configure Kafka topics \ producers \ consumers
- run the algorithms on the data from the Kafka
- save the output to the HBase
Solution:
Our main steps were:
- We put this data in Apache NIFI,
- Removed duplicates in a certain period,
- Created producer to send data to the Apache Kafka.