Big data with Spark Streaming: data from network to the HDFS
The goal is to create the read \ write system that takes the data from US data network provide to the HDFS using Spark Streaming.
The input data represent a set of files that contain logs in certain form. Based on the content of each log it is possible to determine many parameters characterizing the log, particularly, subscriber_id. These impressions are then aggregated by Spark Streaming into a data warehouse. Spark Streaming can be preceded by Apache Kafka Streaming if it is demanded to accelerate and/or improve processing, transformation and further data transfer. At this stage (aggregation using Spark) the log data are joining on subscriber id. During this, all the files collect in a 15 minute interval, which is controlled by config file. The Spark Streaming application create the files in a new directory on each batch window. The aggregated data write to HDFS and copied to the OSP as gzipped files (performing a multi-threaded writing process). The most demand method to write and copy result data to HDFS & OSP is the using of SFTP or SCP, but it should be tested.