Episode 7: Demystifying the buzz words in Big Data

This week I made the following notes while taking the “Big Data Essentials: HDFS, MapReduce and Spark RDD” course on Coursera [1].

Concepts:

Source: DFS, HDFS, Architecture, Scaling problem

In 2003, Google published a paper on “Google File System” (GFS) [2]. In their paper, they described the architecture of a:

  1. scalable
  2. distributed file system
  3. with high fault tolerance
  4. using inexpensive commodity hardware.

In 2005, Apache developed Hadoop File System (HDFS) which is an open source implementation of GFS.
While GFS was originally written in C++, HDFS is written in Java. They both are designed to store petabytes of data across multiple nodes and can be used for data-intensive computing.

Source: Setting up Hadoop Cluster on Amazon Cloud

A cluster is a network of computers. It is mostly composed of three parts [3]:

Source: UC Davis Bioinformatics Core December 2018 Genome Assembly workshop

DFS is based on a data usage pattern called “write-once-read-many” (WORM) which suits the kind of data collection that Google does. It also simplifies its API.

In Shared Disk, each cluster has access to the data stored in disk. Each node has read/write permissions [7].

Source: Parallel Database Systems: The Future of High Performance Database Systems

Thought of the Week:

A couple of days ago I came across an article shared by Yann LeCun (the father of Convolutional Neural Networks (CNN) and also a professor at NYU), titled “The Most Important Court Decision For Data Science and Machine Learning” by Matthew Stewart [4]. This article gives a quick rundown of the famous Authors Guild vs Google case which dragged on for more than ten years, becoming the most important case study in the field of artificial intelligence and machine learning. In 2005, two famous publishers in the US sued Google for using copyrighted books for training their book search algorithms.

Google books search algorithm is a highly sophisticated machine learning algorithm for optimizing book search for users.
Recently, I searched for a couple of titles to understand how the algorithm actually worked. When I entered general keywords like “Lab” or “Blood”, it didn’t list the books based on word frequency in the existing database of books (because that would have simply given me a list of medical books), instead the algorithm also took into account the popularity of the books in other web searches (which is based on PageRank algorithm). For example, for “blood”, the book “Bad Blood” by John Carreyrou was among the top results. It is said that Google has scanned more than 15 million books and put them on web [5].

Anyhow, the end result of the famous case between Authors Guild and Google was that the Supreme Court eventually dismissed the lawsuit and Google was officially let off the hook. But the ruling helped establish some important precedents in Artificial intelligence, such as affirming that the “use of copyrighted material … to train a discriminative machine-learning algorithms (such as for search purposes) is legal” [4] But how would that fare in generative machine learning to produce deep fakes, is still an open question.

Until next time!

References:

[1] Big Data Essentials
[2] Google File System
[3] Analyzing Google File System and Hadoop Distributed File System
[4] The Most Important Court Decision For Data Science and Machine Learning [5] Inside the Google Books Algorithm [6] Shared Nothing v.s. Shared Disk Architectures: An Independent View [7] Difference Between Memory and Storage in Computers

Share this: