HADOOP:
Hadoop has two major components HDFS and MapReduce. HDFS is storage and
MapReduce is programming framework.
Hadoop is a framework which allows us to perform perform parallel and
distributed computations on large data sets. Hadoop has fundamentally two units
- storage and processing. You get HDFS as a storage service with Hadoop to
store large data sets. Basically, HDFS is a distributed file system that allows
you to store large file across the Hadoop cluster.
Hadoop Common: The common utilities that support the other
Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster
resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of
large data sets.
APACHE
SPARK:
Spark is another execution framework. Like MapReduce, it works with the
file system to distribute your data across the cluster, and process that data
in parallel. Like MapReduce, it also takes a set of instructions from an
application written by a developer. MapReduce was generally coded from Java;
Spark supports not only Java, but also Python and Scala, which is a newer
language that contains some attractive properties for manipulating data.
What are the Spark use cases?
Databricks (a company founded by the creators of Apache Spark) lists the following
cases for Spark:
- Data integration and ETL
- Interactive analytics or business intelligence
- High performance batch computation
- Machine learning and advanced analytics
- Real-time stream processingSCALA:Scala is an acronym for “Scalable Language”. To some, Scala feels like a scripting language. Its syntax is concise and low ceremony; its types get out of the way because the compiler can infer them.Scala is a pure-bred object-oriented language. Conceptually, every value is an object and every operation is a method-call.The language supports advanced component architectures through classes and traits.Many traditional design patterns in other languages are already natively supported. For instance, singletons are supported through object definitions and visitors are supported through pattern matching.Using implicit classes, Scala even allows you to add new operations to existing classes, no matter whether they come from Scala or Java!HIVE:Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.
Feature
|
Description
|
Familiar
|
Query data with a SQL-based
language
|
Fast
|
Interactive response times,
even over huge datasets
|
Scalable and Extensible
|
As data variety and volume
grows, more commodity machines can be added, without a corresponding
reduction in performance
|
Compatible
|
Works with
traditional data integration and data analytics tools.
|
- The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.
- Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
- Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.MONGO DB:
- MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time
- The document model maps to the objects in your application code, making data easy to work with
- Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyze your data
- MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution are built in and easy to use
- MongoDB is free and open-source, published under the GNU Affero General Public License