Given that reporting tools and related technologies evolve constantly, the need to process data close to real time is growing. The evolution of mobile technology, the excessive use of social media applications, the development of location-based services and the increase of daily data production rate, confirm the fact that large-scale data processing is increasingly gaining in importance.
One of the most widely used platforms for big data processing is Apache Hadoop. Shortly explained, Hadoop is a distributed computing framework that consists of two main components, HDFS and MapReduce. HDFS handles the data storage on the multiple machines, whereas MapReduce handles the processing of data across the machines of the cluster. As MapReduce requires a significant duration of time for its execution, it is insufficient for real-time data processing. However, this gap is filled with the introduction of streaming analytics platforms. Continue reading “An introduction to large-scale data processing: Installation Guide of Apache Spark 2.3.1 and Hadoop 3”