On this project we had 2 tasks:
1. To develop an algorithm that will search the best matches of some input in the list of canonical inputs. Each input represents a job title in canonical or any incorrect form. This algorithm should be sensitive to acronyms and be able to find seniorities (like senior, junior, etc.) in an input job title. The measure of good work of the algorithm is the similarity score between input and canonical job titles.
2. To create a simple Web application which contains an input field where the type his query and returns the list of three canonical job titles with similarity score.
Our main steps were to establish proper big data pipeline:
1. get through AJAX from front-end the input phrase;
2. canonization of the input phrase:
2.1. punctuation removing;
2.3. stop words removing;
2.4. unicode processing;
3. acronyms replacement;
4. seniority detection and replacement where in is necessarily;
5. determining of preferable metrics and phonetic algorithms;
6. similarity score calculation based on a specific combination of above algorithms;
7. use-cases processing;
8. finding of top three the most similar canonical job titles;
9. display these results on the web page.