This program follows a set structure with 6 core courses and 3 electives spread across 29 weeks. It makes you an expert in key technologies related to Big Data ecosystem. At the end of each core course, you will be working on a real-time project to gain hands on expertise. By the end of the program you will be ready for seasoned Big Data job roles.
Big Data Architect Masters Program
- Description
- Curriculum
- Reviews
-
1Java Essentials
Introduction to Java
Goal:
In this module, you will learn about Java architecture, advantages of Java, develop the code with various data types, conditions and loops.
Objectives:
At the end of this module, you will be able to
• Understand the advantages of Java
• Understand where Java is used
• Understand how memory management is handled in Java
• Create a Java project in Eclipse and execute it
• Implement if..else construct in Java
• Develop codes using various data types in Java
• Implement various loops
Topics:
• Introduction to Java
• Bytecode
• Class Files
• Compilation Process
• Data types and Operations
• If conditions
• Loops - for, while and do while
Hands On/Demo:
• Data Types and Operations
• if Condition
• for..loop
• while..loop
• do..while loop
Data Handling and Functions
Goal:
In this module, you will learn how to code with arrays, functions and strings using examples and Programs.
Objectives:
At the end of this module, you will be able to
• Implement Single and Multi-dimensional array
• Declare and Define Functions
• Call Functions by value and by reference
• Implement Method Overloading
• Use String data-type and String-buffer
Topics:
• Arrays - Single Dimensional and Multidimensional arrays
• Functions
• Function with Arguments
• Function Overloading
• Concept of Static Polymorphism
• String Handling -String
• String buffer Classes
Hands On/Demo:
• Declaring the arrays
• Accepting data for the arrays
• Calling the functions which takes arguments, perform search in the array and display the record by calling the function which takes arguments
Object Oriented Programming in Java
Goal:
In this module, you will learn object oriented programming through Java using Classes, Objects and various Java concepts like Abstract, Final etc.
Objectives:
At the end of this module, you will be able to
• Implement classes and objects in Java
• Create class constructors
• Overload constructors
• Inherit classes and create sub-classes
• Implement abstract classes and methods
• Use static keyword
Topics:
• OOPS in Java:
o Concept of Object Orientation
o Attributes and Methods
o Classes and Objects
• Methods and Constructors
o Default Constructors
o Constructors with Arguments
o Inheritance
o Abstract
o Final and Static
Hands On/Demo:
• Inheritance
• Overloading
• Overriding
Packages and Multi-Threading
Goal:
In this module, you will learn about packages in Java and scope specifiers of Java. You will also learn exception handling and how multi-threading works in Java.
Objectives:
At the end of this module, you will be able to
• Implement interface and use it
• Extend interface with other interface
• Create package and name it Import packages while creating a new class
• Understand various exceptions
• Handle exception using try catch block
• Handle exception using throw and throws keyword
• Implement threads using thread class and runnable interface
• Understand and implement multithreading
Topics:
• Packages and Interfaces
• Access Specifiers
• Package
• Exception Handling
• Multi-Threading
Hands On/Demo:
• Interfaces
• Packages
• Exception
• Thread
Collections
Goal:
In this module, you will learn how to write code with Wrapper Classes, Inner Classes and Applet Programs. How to use io, lang and util packages of Java and Collections.
Objectives:
At the end of this module, you will be able to
• Identifiy, and use important Inbult Java Packages like java.lang, java.io, jara.util etc.
• Use Wrapper classes
• Understand collections framework
• Implement logic using ArrayList and Vector and Queue
• Use set, HashSet and TreeSet
• Implement logic using Nap HashMap and Hashtable
Topics:
• Wrapper Classes and Inner Classes: Integer, Character, Boolean, Float etc.
• Applet Programs: How to write UI programs with Applet, Java.lang, Java.io, Java.util.
• Collections: ArrayList, Vector, HashSet, TreeSet, HashMap, HashTable.
Hands On/Demo:
• Wrapper class
• Collection
-
2Big Data Hadoop Certification Training
Understanding Big Data and Hadoop
Learning Objectives: In this module, you will understand what Big Data is, the limitations of the traditional solutions for Big Data problems, how Hadoop solves those Big Data problems, Hadoop Ecosystem, Hadoop Architecture, HDFS, Anatomy of File Read and Write & how MapReduce works.
Topics:
- Introduction to Big Data & Big Data Challenges
- Limitations & Solutions of Big Data Architecture
- Hadoop & its Features
- Hadoop Ecosystem
- Hadoop 2.x Core Components
- Hadoop Storage: HDFS (Hadoop Distributed File System)
- Hadoop Processing: MapReduce Framework
- Different Hadoop Distributions
Hadoop Architecture and HDFS
Learning Objectives: In this module, you will learn Hadoop Cluster Architecture, important configuration files of Hadoop Cluster, Data Loading Techniques using Sqoop & Flume, and how to setup Single Node and Multi-Node Hadoop Cluster.
Topics:
- Hadoop 2.x Cluster Architecture
- Federation and High Availability Architecture
- Typical Production Hadoop Cluster
- Hadoop Cluster Modes
- Common Hadoop Shell Commands
- Hadoop 2.x Configuration Files
- Single Node Cluster & Multi-Node Cluster set up
- Basic Hadoop Administration
Hadoop MapReduce Framework
Learning Objectives: In this module, you will understand Hadoop MapReduce framework comprehensively, the working of MapReduce on data stored in HDFS. You will also learn the advanced MapReduce concepts like Input Splits, Combiner & Partitioner.
Topics:
- Traditional way vs MapReduce way
- Why MapReduce
- YARN Components
- YARN Architecture
- YARN MapReduce Application Execution Flow
- YARN Workflow
- Anatomy of MapReduce Program
- Input Splits, Relation between Input Splits and HDFS Blocks
- MapReduce: Combiner & Partitioner
- Demo of Health Care Dataset
- Demo of Weather Dataset
Advanced Hadoop MapReduce
Learning Objectives: In this module, you will learn Advanced MapReduce concepts such as Counters, Distributed Cache, MRunit, Reduce Join, Custom Input Format, Sequence Input Format and XML parsing.
Topics:
- Counters
- Distributed Cache
- MRunit
- Reduce Join
- Custom Input Format
- Sequence Input Format
- XML file Parsing using MapReduce
Apache Pig
Learning Objectives: In this module, you will learn Apache Pig, types of use cases where we can use Pig, tight coupling between Pig and MapReduce, and Pig Latin scripting, Pig running modes, Pig UDF, Pig Streaming & Testing Pig Scripts. You will also be working on healthcare dataset.
Topics:
- Introduction to Apache Pig
- MapReduce vs Pig
- Pig Components & Pig Execution
- Pig Data Types & Data Models in Pig
- Pig Latin Programs
- Shell and Utility Commands
- Pig UDF & Pig Streaming
- Testing Pig scripts with Punit
- Aviation use-case in PIG
- Pig Demo of Healthcare Dataset
Apache Hive
Learning Objectives: This module will help you in understanding Hive concepts, Hive Data types, loading and querying data in Hive, running hive scripts and Hive UDF.
Topics:
- Introduction to Apache Hive
- Hive vs Pig
- Hive Architecture and Components
- Hive Metastore
- Limitations of Hive
- Comparison with Traditional Database
- Hive Data Types and Data Models
- Hive Partition
- Hive Bucketing
- Hive Tables (Managed Tables and External Tables)
- Importing Data
- Querying Data & Managing Outputs
- Hive Script & Hive UDF
- Retail use case in Hive
- Hive Demo on Healthcare Dataset
Advanced Apache Hive and HBase
Learning Objectives: In this module, you will understand advanced Apache Hive concepts such as UDF, Dynamic Partitioning, Hive indexes and views, and optimizations in Hive. You will also acquire in-depth knowledge of Apache HBase, HBase Architecture, HBase running modes and its components.
Topics:
- Hive QL: Joining Tables, Dynamic Partitioning
- Custom MapReduce Scripts
- Hive Indexes and views
- Hive Query Optimizers
- Hive Thrift Server
- Hive UDF
- Apache HBase: Introduction to NoSQL Databases and HBase
- HBase v/s RDBMS
- HBase Components
- HBase Architecture
- HBase Run Modes
- HBase Configuration
- HBase Cluster Deployment
Advanced Apache HBase
Learning Objectives: This module will cover advance Apache HBase concepts. We will see demos on HBase Bulk Loading & HBase Filters. You will also learn what Zookeeper is all about, how it helps in monitoring a cluster & why HBase uses Zookeeper.
Topics:
- HBase Data Model
- HBase Shell
- HBase Client API
- Hive Data Loading Techniques
- Apache Zookeeper Introduction
- ZooKeeper Data Model
- Zookeeper Service
- HBase Bulk Loading
- Getting and Inserting Data
- HBase Filters
Processing Distributed Data with Apache Spark
Learning Objectives: In this module, you will learn what is Apache Spark, SparkContext & Spark Ecosystem. You will learn how to work in Resilient Distributed Datasets (RDD) in Apache Spark. You will be running application on Spark Cluster & comparing the performance of MapReduce and Spark.
Topics:
- What is Spark
- Spark Ecosystem
- Spark Components
- What is Scala
- Why Scala
- SparkContext
- Spark RDD
Oozie and Hadoop Project
Learning Objectives: In this module, you will understand how multiple Hadoop ecosystem components work together to solve Big Data problems. This module will also cover Flume & Sqoop demo, Apache Oozie Workflow Scheduler for Hadoop Jobs, and Hadoop Talend integration.
Topics:
- Oozie
- Oozie Components
- Oozie Workflow
- Scheduling Jobs with Oozie Scheduler
- Demo of Oozie Workflow
- Oozie Coordinator
- Oozie Commands
- Oozie Web Console
- Oozie for MapReduce
- Combining flow of MapReduce Jobs
- Hive in Oozie
- Hadoop Project Demo
- Hadoop Talend Integration
Certification Project
1) Analyses of a Online Book Store
A. Find out the frequency of books published each year. (Hint: Sample dataset will be provided)
B. Find out in which year maximum number of books were published
C. Find out how many books were published based on ranking in the year 2002.
Sample Dataset Description
The Book-Crossing dataset consists of 3 tables that will be provided to you.
2) Airlines Analysis
A. Find list of Airports operating in the Country India
B. Find the list of Airlines having zero stops
C. List of Airlines operating with code share
D. Which country (or) territory having highest Airports
E. Find the list of Active Airlines in United state
Sample Dataset Description
In this use case, there are 3 data sets. Final_airlines, routes.dat, airports_mod.dat
-
3Apache Spark and Scala Certification Training
Introduction to Big Data Hadoop and Spark
Learning Objectives:
- Understand Big Data and its components such as HDFS. You will learn about the Hadoop Cluster Architecture and you will also get an introduction to Spark and you will get to know about the difference between batch processing and real-time processing.
Topics:
- What is Big Data?
- Big Data Customer Scenarios
- Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
- How Hadoop Solves the Big Data Problem?
- What is Hadoop?
- Hadoop’s Key Characteristics
- Hadoop Ecosystem and HDFS
- Hadoop Core Components
- Rack Awareness and Block Replication
- YARN and its Advantage
- Hadoop Cluster and its Architecture
- Hadoop: Different Cluster Modes
- Big Data Analytics with Batch & Real-time Processing
- Why Spark is needed?
- What is Spark?
- How Spark differs from other frameworks?
- Spark at Yahoo!
Introduction to Scala for Apache Spark
Learning Objectives:
- Learn the basics of Scala that are required for programming Spark applications. You will also learn about the basic constructs of Scala such as variable types, control structures, collections such as Array, ArrayBuffer, Map, Lists, and many more.
Topics:
- What is Scala?
- Why Scala for Spark?
- Scala in other Frameworks
- Introduction to Scala REPL
- Basic Scala Operations
- Variable Types in Scala
- Control Structures in Scala
- Foreach loop, Functions and Procedures
- Collections in Scala- Array
- ArrayBuffer, Map, Tuples, Lists, and more
Hands-on:
- Scala REPL Detailed Demo
Functional Programming and OOPs Concepts in Scala
Learning Objectives:
- In this module, you will learn about object-oriented programming and functional programming techniques in Scala.
Topics:
- Functional Programming
- Higher Order Functions
- Anonymous Functions
- Class in Scala
- Getters and Setters
- Custom Getters and Setters
- Properties with only Getters
- Auxiliary Constructor and Primary Constructor
- Singletons
- Extending a Class
- Overriding Methods
- Traits as Interfaces and Layered Traits
Hands-on:
- OOPs Concepts
- Functional Programming
Deep Dive into Apache Spark Framework
Learning Objectives:
- Understand Apache Spark and learn how to develop Spark applications. At the end, you will learn how to perform data ingestion using Sqoop.
Topics:
- Spark’s Place in Hadoop Ecosystem
- Spark Components & its Architecture
- Spark Deployment Modes
- Introduction to Spark Shell
- Writing your first Spark Job Using SBT
- Submitting Spark Job
- Spark Web UI
- Data Ingestion using Sqoop
Hands-on:
- Building and Running Spark Application
- Spark Application Web UI
- Configuring Spark Properties
- Data ingestion using Sqoop
Playing with Spark RDDs
Learning Objectives:
- Get an insight of Spark - RDDs and other RDD related manipulations for implementing business logics (Transformations, Actions and Functions performed on RDD).
Topics:
- Challenges in Existing Computing Methods
- Probable Solution & How RDD Solves the Problem
- What is RDD, It’s Operations, Transformations & Actions
- Data Loading and Saving Through RDDs
- Key-Value Pair RDDs
- Other Pair RDDs, Two Pair RDDs
- RDD Lineage
- RDD Persistence
- WordCount Program Using RDD Concepts
- RDD Partitioning & How It Helps Achieve Parallelization
- Passing Functions to Spark
Hands-on:
- Loading data in RDDs
- Saving data through RDDs
- RDD Transformations
- RDD Actions and Functions
- RDD Partitions
- WordCount through RDDs
DataFrames and Spark SQL
Learning Objectives:
- In this module, you will learn about SparkSQL which is used to process structured data with SQL queries, data-frames and datasets in Spark SQL along with different kind of SQL operations performed on the data-frames. You will also learn about the Spark and Hive integration.
Topics:
- Need for Spark SQL
- What is Spark SQL?
- Spark SQL Architecture
- SQL Context in Spark SQL
- User Defined Functions
- Data Frames & Datasets
- Interoperating with RDDs
- JSON and Parquet File Formats
- Loading Data through Different Sources
- Spark – Hive Integration
Hands-on:
- Spark SQL – Creating Data Frames
- Loading and Transforming Data through Different Sources
- Stock Market Analysis
- Spark-Hive Integration
Machine Learning using Spark MLlib
Learning Objectives:
- Learn why machine learning is needed, different Machine Learning techniques/algorithms, and SparK MLlib.
Topics:
- Why Machine Learning?
- What is Machine Learning?
- Where Machine Learning is Used?
- Face Detection: USE CASE
- Different Types of Machine Learning Techniques
- Introduction to MLlib
- Features of MLlib and MLlib Tools
- Various ML algorithms supported by MLlib
Deep Dive into Spark MLlib
Learning Objectives:
- Implement various algorithms supported by MLlib such as Linear Regression, Decision Tree, Random Forest and many more.
Topics:
- Supervised Learning - Linear Regression, Logistic Regression, Decision Tree, Random Forest
- Unsupervised Learning - K-Means Clustering & How It Works with MLlib
- Analysis on US Election Data using MLlib (K-Means)
Hands-on:
- Machine Learning MLlib
- K- Means Clustering
- Linear Regression
- Logistic Regression
- Decision Tree
- Random Forest
Understanding Apache Kafka and Apache Flume
Learning Objectives:
- Understand Kafka and its Architecture. Also, learn about Kafka Cluster, how to configure different types of Kafka Cluster. Get introduced to Apache Flume, its architecture and how it is integrated with Apache Kafka for event processing. At the end, learn how to ingest streaming data using flume.
Topics:
- Need for Kafka
- What is Kafka?
- Core Concepts of Kafka
- Kafka Architecture
- Where is Kafka Used?
- Understanding the Components of Kafka Cluster
- Configuring Kafka Cluster
- Kafka Producer and Consumer Java API
- Need of Apache Flume
- What is Apache Flume?
- Basic Flume Architecture
- Flume Sources
- Flume Sinks
- Flume Channels
- Flume Configuration
- Integrating Apache Flume and Apache Kafka
Hands-on:
- Configuring Single Node Single Broker Cluster
- Configuring Single Node Multi Broker Cluster
- Producing and consuming messages
- Flume Commands
- Setting up Flume Agent
- Streaming Twitter Data into HDFS
Apache Spark Streaming - Processing Multiple Batches
Learning Objectives:
- Work on Spark streaming which is used to build scalable fault-tolerant streaming applications. Also, learn about DStreams and various Transformations performed on the streaming data. You will get to know about commonly used streaming operators such as Sliding Window Operators and Stateful Operators.
Topics:
- Drawbacks in Existing Computing Methods
- Why Streaming is Necessary?
- What is Spark Streaming?
- Spark Streaming Features
- Spark Streaming Workflow
- How Uber Uses Streaming Data
- Streaming Context & DStreams
- Transformations on DStreams
- Describe Windowed Operators and Why it is Useful
- Important Windowed Operators
- Slice, Window and ReduceByWindow Operators
- Stateful Operators
Apache Spark Streaming - Data Sources
Learning Objectives:
- In this module, you will learn about the different streaming data sources such as Kafka and flume. At the end of the module, you will be able to create a spark streaming application.
Topics:
- Apache Spark Streaming: Data Sources
- Streaming Data Source Overview
- Apache Flume and Apache Kafka Data Sources
- Example: Using a Kafka Direct Data Source
- Perform Twitter Sentimental Analysis Using Spark Streaming
Hands-on:
- Different Streaming Data Sources
In-class Project
Learning Objectives:
- Work on an end-to-end Financial domain project covering all the major concepts of Spark taught during the course.
Spark GraphX (Self-Paced)
Learning Objectives:
- In this module, you will be learning the key concepts of Spark GraphX programming and operations along with different GraphX algorithms and their implementations.
-
4Apache Cassandra Certification Training
Introduction to Big Data, and Cassandra
Goal: In this module you will get a brief introduction of Big Data and how it creates problems for traditional Database Management Systems like RDBMS. You will also learn how Cassandra solves these problems and understand Cassandra’s features.
Skills:
- Basic concepts of Cassandra
Objectives:
At the end of this module, you will be able to
- Explain what is Big Data
- List the Limitations of RDBMS
- Define NoSQL and it’s Characteristics
- Define CAP Theorem
- Learn Cassandra
- List the Features of Cassandra
- Get a Tour of Edureka’s VM
Topics:
- Introduction to Big Data and Problems caused by it
- 5V – Volume, Variety, Velocity, Veracity and Value
- Traditional Database Management System
- Limitations of RDMS
- NOSQL databases
- Common characteristics of NoSQL databases
- CAP theorem
- How Cassandra solves the Limitations?
- History of Cassandra
- Features of Cassandra
Hands On:
- Edureka VM tour
Cassandra Data Model
Goal: In this module, you will learn about Database Model and similarities between RDBMS and Cassandra Data Model. You will also understand the key Database Elements of Cassandra and learn about the concept of Primary Key.
Skills:
- Data Modelling in Cassandra
- Data Structure Design
Objectives:
At the end of this module, you will be able to
- Explain what is Database Modelling and it’s Features
- Describe the Different Types of Data Models
- List the Difference between RDBMS and Cassandra Data Model
- Define Cassandra Data Model
- Explain Cassandra Database Elements
- Implement Keyspace Creation, Updating and Deletion
- Implement Table Creation, Updating and Deletion
Topics:
- Introduction to Database Model
- Understand the analogy between RDBMS and Cassandra Data Model
- Understand following Database Elements: Cluster, Keyspace, Column Family/Table, Column
- Column Family Options
- Columns
- Wide Rows, Skinny Rows
- Static and dynamic tables
Hands-On:
- Creating Keyspace
- Creating Tables
Cassandra Architecture
Goal: Gain knowledge of architecting and creating Cassandra Database Systems. In addition, learn about the complex inner workings of Cassandra such as Gossip Protocol, Read Repairs and so on.
Skills:
• Cassandra Architecture
Objectives: At the end of this module, you will be able to:
• Explain the Architecture of Cassandra
• Describe the Different Layers of Cassandra Architecture
• Learn about Gossip Protocol
• Describe Partitioning and Snitches
• Explain Vnodes and How Read and Write Path works
• Understand Compaction, Anti-Entropy and Tombstone
• Describe Repairs in Cassandra
• Explain Hinted Handoff
Topics:
• Cassandra as a Distributed Database
• Key Cassandra Elements
a. Memtable
b. Commit log
c. SSTables
• Replication Factor
• Data Replication in Cassandra
• Gossip protocol – Detecting failures
• Gossip: Uses
• Snitch: Uses
• Data Distribution
• Staged Event-Driven Architecture (SEDA)
• Managers and Services
• Virtual Nodes: Write path and Read path
• Consistency level
• Repair
• Incremental repair
Deep Dive into Cassandra Database
Goal: In this module you will learn about Keyspace and its attributes in Cassandra. You will also create Keyspace, learn how to create a Table and perform operations like Inserting, Updating and Deleting data from a table while using CQLSH.
Skills:
• Database Operations
• Table Operations
Objectives: At the end of this module, you will be able to:
• Describe Different Data Types Used in Cassandra
• Explain Collection Types
• Describe What are CRUD Operations
• Implement Insert, Select, Update and Delete of various elements
• Implement Various Functions Used in Cassandra
• Describe Importance of Roles and Indexing
• Understand tombstones in Cassandra
Topics:
• Replication Factor
• Replication Strategy
• Defining columns and data types
• Defining a partition key
• Recognizing a partition key
• Specifying a descending clustering order
• Updating data
• Tombstones
• Deleting data
• Using TTL
• Updating a TTL
Hands-on/Demo
• Create Keyspace in Cassandra
• Check Created Keyspace in System_Schema.Keyspaces
• Update Replication Factor of Previously Created Keyspace
• Drop Previously Created Keyspace
• Create A Table Using cqlsh
• Create A Table Using UUID & TIMEUUID
• Create A Table Using Collection & UDT Column
• Create Secondary Index On a Table
• Insert Data Into Table
• Insert Data into Table with UUID & TIMEUUID Columns
• Insert Data Using COPY Command
• Deleting Data from Table
Node Operations in a Cluster
Goal: Learn how to add nodes in Cassandra and configure Nodes using “cassandra.yaml” file. Use Nodetool to remove node and restore node back into the service. In addition, by using Nodetool repair command learn the importance of repair and how repair operation functions.
Skills:
• Node Operations
Objectives: At the end of this module, you will be able to:
• Explain Cassandra Nodes
• Understand Seed Nodes
• Configure Seed Nodes using cassandra.yaml file
• Add/bootstrap a node in a Cluster
• Use Nodetool utility to decommission a node from the cluster
• Remove a Dead Node from a Cluster
• Describe the need to repair Nodes
• Use Nodetool repair command
Topics:
• Cassandra nodes
• Specifying seed nodes
• Bootstrapping a node
• Adding a node (Commissioning) in Cluster
• Removing (Decommissioning) a node
• Removing a dead node
• Repair
• Read Repair
• What’s new in incremental repair
• Run a Repair Operation
• Cassandra and Spark Implementation
Hands On:
• Commissioning a Node
• Decommissioning a Node
• Nodetool Commands
Managing and Monitoring the Cluster
Goal: The key aspects to monitoring Cassandra are resources used by each node, response latencies to requests, requests to offline nodes, and the compaction process. Learn to use various monitoring tools in Cassandra such as Nodetool and JConsole in this module.
Skills:
• Clustering
Objectives: At the end of this module, you will be able to:
• Describe the various monitoring tools available
• Implement nodetool utility to manage a cluster
• Use JConsole to monitor JMX statistics
• Understand OpsCenter tool
Topics:
• Cassandra monitoring tools
• Logging
• Tailing
• Using Nodetool Utility
• Using JConsole
• Learning about OpsCenter
• Runtime Analysis Tools
Hands On:
• JMX and Jconsole
• OpsCenter
Backup & Restore and Performance Tuning
Goal: In this Module you will learn about the importance of Backup and Restore functions in Cassandra and Create Snapshots in Cassandra. You will learn about Hardware selection and Performance Tuning (Configuring Log Files) in Cassandra. You will also learn about Cassandra integration with various other frameworks.
Skills:
• Performance tuning
• Cassandra Design Principals
• Backup and Restoration
Objectives: At the end of this module, you’ll be able to:
• Learn backup and restore functionality and its importance
• Create a snapshot using Nodetool utility
• Restore a snapshot
• Understand how to choose the right balance of the following resources: memory, CPU, disks, number of nodes, and network.
• Understand all the logs created by Cassandra
• Explain the purpose of different log files
• Configure the log files
• Learn about Performance Tuning
• Integration with Spark and Kafka
Topics:
• Creating a Snapshot
• Restoring from a Snapshot
• RAM and CPU recommendations
• Hardware choices
• Selecting storage
• Types of Storage to Avoid
• Cluster connectivity, security and the factors that affect distributed system performance
• End-to-end performance tuning of Cassandra clusters against very large data sets
• Load balance and streams
Hands On:
• Creating Snapshots
• Integration with Kafka
• Integration with Spark
Hosting Cassandra Database on Cloud
Goal: In this Module you will learn about Design, Implementation, and on-going support of Cassandra Operational Data. Finally, you will learn how to Host a Cassandra Database on Cloud.
Skills:
• Security
• Design Implementation
• On-going support of Cassandra Operational Data
Objectives: At the end of this module, you’ll be able to:
• Security
• Learn about DataStax
• Create an End-to-End Project using Cassandra
• Implement a Cassandra Database on Cloud
Topics:
• Security
• Ongoing Support of Cassandra Operational Data
• Hosting a Cassandra Database on Cloud
Hands On:
• Hosting Cassandra Database on Amazon Web Services
-
5Talend for Data Integration and Big data
Talend – A Revolution in Big Data
Learning Objectives: In this module of Talend Training, you will get an overview of ETL Technologies and the reason why Talend is referred as the next Generation Leader in Big Data Integration. You will be introduced to various products offered by Talend Corporation till date and its relevance to Data Integration and Big Data. Further, you will learn about the TOS (Talend Open Studio), its Architecture, GUI, and how to install TOS.
Skills:
- Core ETL concepts
- Talend products and their features
- Design and implementation of Talend Open Studio
Topics:
- Working with ETL
- Rise of Big Data
- Role of Open Source ETL Technologies in Big Data
- Comparison with other market leader tools in ETL domain
- Importance of Talend (Why Talend)
- Talend and its Products
- Introduction of Talend Open Studio
- TOS for Data Integration
- GUI of TOS with Demo
Hands-on/Demo:
- Creating a basic job
Working with Talend Open Studio for DI
Learning Objectives: In this module of Talend course, you will learn to work with various types of Data Source, Target Systems supported by Talend, Metadata and how to read/write from popular CSV/Delimited file and fixed width file. Connect to a Database and read/write/update data and read complex source system like Excel and XML along with some of the basic components like tLog, tMap using TOS.
Skills:
- Create jobs with different components and link them
- Read and write files of various format
- Work with Database
Topics:
- Launching Talend Studio
- Working with different workspace directories
- Working with projects
- Creating and executing jobs
- Connection types and triggers
- Most frequently used Talend components [tJava, tLogRow, tMap]
- Read & Write Various Types of Source/Target Systems
- Working with files [CSV, XLS, XML, Positional]
- Working with databases [MySQL DB]
- Metadata management
Hands-on/Demo/Use-case:
- Creating a Business Model
- Adding Components to a Job
- Connecting the Components
- Reading and writing Delimited File
- Reading and writing Positional File
- Reading and writing XML and Xls/Xlsx Files
- Connecting Database(MySQL)
- Retrieving Schema from the Database
- Reading from Database Metadata
- Retrieving data from a file and inserting it into the Database
- Deleting data from Database
- Working with Logs and Error
Basic Transformations in Talend
Learning Objectives: In this module of Talend Training, you will understand Data Mapping and Transformations using TOS. In addition, you will learn how to filter and join various Data Sources using lookups and search and sort through them.
Skills:
- Create and use context variables
- Mapping and Transformations
- Work with components like tFilter, tJoin, tSortRow, tReplicate, tSplit, Lookup
Topics:
- Context Variables
- Using Talend components
- tJoin
- tFilter
- tSortRow
- tAggregateRow
- tReplicate
- tSplit
- Lookup
- tRowGenerator
- Accessing job level/ component level information within the job
- SubJob (using tRunJob, tPreJob, tPostJob)
Hands-on/Demo/Use-case:
- Embedding Context Variables
- Adding different environments
- Data Mapping using tMap
- Using functions in Talend
- tJava
- tSortRow
- tAggregateRow
- tReplicate
- tFilter
- tSplit
- tRowGenerator
- Perform Lookup operations using tJoin
- Creating SubJob (using tRunJob, tPreJob, tPostJob)
Advance Transformations and Executing Jobs remotely in Talend
Learning Objectives: In this module of Talend Certification, you will understand the Transformation and various steps involved in looping job of Talend, ways to search files in a directory and how to process them in a sequence. You will also learn to work with FTP connections, export and import Jobs, run the jobs remotely and parameterize them from the command line.
Skills:
- Use various file components like tFileList, tFileCopy, tFileExists, tFileDelete, tFileArchive
- Handle logs and errors
- Cast data types using tConvert and tMap expression builder
- Iterate components using tLoop
- Store and retrieve files from FTP
- Remotely access Talend
Topics:
- Various components of file management (like tFileList, tFileAchive, tFileTouch, tFileDelete)
- Error Handling [tWarn, tDie]
- Type Casting (convert datatypes among source-target platforms)
- Looping components (like tLoop, tForeach)
- Using FTP components (like tFTPFileList, tFTPFileExists, tFTPGet, tFTPPut)
- Exporting and Importing Talend jobs
- How to schedule and run Talend DI jobs externally (using Command line)
- Parameterizing a Talend job from command line
Hands-on/Demo/Use-case:
- Implementing File Management (like tFileList, tFileAchive, tFileTouch, tFileDelete)
- Type Casting (tConvert and tMap(using Expression Builder))
- Looping components (like tLoop, tForeach)
- Using FTP components (like tFTPFileList, tFTPFileExists, tFTPGet, tFTPPut)
- Exporting and Importing Talend Jobs
- Parameterizing a Talend Job from command line
Big Data and Hadoop with Talend
Learning Objectives: In this module of Talend Training, you will learn about Big Data and Hadoop concepts, such as HDFS (Hadoop Distributed File System) Architecture, MapReduce, leveraging Big Data through Talend and Talend & Big Data Integration. Learn to set up and use the Talend Open Studio for Big Data. In addition, you will learn to use Big Data connectors in TOS (Talend offers some 800+ connectors for Big Data environment) and access Hadoop Ecosystem from Talend.
Skills:
- Understand scope of Talend Open Studio for Big Data
- Integrate Hadoop HDFS and Talend
- Use Hadoop operations like Map and Aggregate through TOS Big Data
- Perform multiple analyses and store results in HDFS
Topics:
- Big Data and Hadoop
- HDFS and MapReduce
- Benefits of using Talend with Big Data
- Integration of Talend with Big Data
- HDFS commands Vs Talend HDFS utility
- Big Data setup using Hortonworks Sandbox in your personal computer
- Explaining the TOS for Big Data Environment
Hands-on/Demo/Use-case:
- Creating a Project and a Job
- Adding Components in a Job
- Connecting to HDFS
- `Putting` files on HDFS
- Using tMap, tAggregate functions
Hive in Talend
Learning Objectives: In this module of Talend Certification Training, you will learn Hive concepts and the setup of Hive environment in Talend. You will learn how to use Hive Big Data connectors in TOS and implement Use Cases using Hive in Talend.
Skills:
- Integrate Hive with TOS Big Data
- Perform complex Hive queries in Talend
Topics:
- Hive and It’s Architecture
- Connecting to Hive Shell
- Set connection to Hive database using Talend
- Create Hive Managed and external tables through Talend
- Load and Process Hive data using Talend
- Transform data from Hive using Talend
Hands-on/Demo/Use-case:
- Process and transform data from Hive
- Load data from HDFS & Local File Systems to Hive Table using Hive Shell
- Execute the HiveQL query using Talend
Pig and Kafka in Talend
Learning Objectives: In this module of Talend course, you will learn the PIG concepts, the setup of Pig Environment in Talend and Pig Big Data connectors in TOS for Big Data and implement Use Cases using Pig in Talend. Also, you will be given an insight of Apache Kafka, its architecture, and integration with Talend through a real-life use case.
Skills:
- Integrate Talend projects with Pig and Kafka
- Use Pig for scripting and Kafka for streaming jobs in TOS Big Data
- Use TOS Big Data for running Pig and Kafka along with DI, Hadoop HDFS, and Hive
Topics:
- Pig Environment in Talend
- Pig Data Connectors
- Integrate Personalized Pig Code into a Talend job
- Apache Kafka
- Kafka Components in TOS for Big data
Hands-on/Demo/Use-case:
- Use Pig and Kafka connectors in Talend
End to End Project in Talend
Learning Objectives: In this module of Talend Training, you will be developing a Project using Talend DI and Talend BD with MySQL, Hadoop, HDFS, Hive, Pig, and Kafka.
-
6Apache Kafka Certification Training
Introduction to Big Data and Apache Kafka
Goal: In this module, you will understand where Kafka fits in the Big Data space, and Kafka Architecture. In addition, you will learn about Kafka Cluster, its Components, and how to Configure a Cluster
Skills:
- Kafka Concepts
- Kafka Installation
- Configuring Kafka Cluster
Objectives: At the end of this module, you should be able to:
- Explain what is Big Data
- Understand why Big Data Analytics is important
- Describe the need of Kafka
- Know the role of each Kafka Components
- Understand the role of ZooKeeper
- Install ZooKeeper and Kafka
- Classify different type of Kafka Clusters
- Work with Single Node-Single Broker Cluster
Topics:
- Introduction to Big Data
- Big Data Analytics
- Need for Kafka
- What is Kafka?
- Kafka Features
- Kafka Concepts
- Kafka Architecture
- Kafka Components
- ZooKeeper
- Where is Kafka Used?
- Kafka Installation
- Kafka Cluster
- Types of Kafka Clusters
- Configuring Single Node Single Broker Cluster
Hands on:
- Kafka Installation
- Implementing Single Node-Single Broker Cluster
Kafka Producer
Goal: Kafka Producers send records to topics. The records are sometimes referred to as Messages. In this Module, you will work with different Kafka Producer APIs.
Skills:
- Configure Kafka Producer
- Constructing Kafka Producer
- Kafka Producer APIs
- Handling Partitions
Objectives:
At the end of this module, you should be able to:
- Construct a Kafka Producer
- Send messages to Kafka
- Send messages Synchronously & Asynchronously
- Configure Producers
- Serialize Using Apache Avro
- Create & handle Partitions
Topics:
- Configuring Single Node Multi Broker Cluster
- Constructing a Kafka Producer
- Sending a Message to Kafka
- Producing Keyed and Non-Keyed Messages
- Sending a Message Synchronously & Asynchronously
- Configuring Producers
- Serializers
- Serializing Using Apache Avro
- Partitions
Hands On:
- Working with Single Node Multi Broker Cluster
- Creating a Kafka Producer
- Configuring a Kafka Producer
- Sending a Message Synchronously & Asynchronously
Kafka Consumer
Goal: Applications that need to read data from Kafka use a Kafka Consumer to subscribe to Kafka topics and receive messages from these topics. In this module, you will learn to construct Kafka Consumer, process messages from Kafka with Consumer, run Kafka Consumer and subscribe to Topics
Skills:
- Configure Kafka Consumer
- Kafka Consumer API
- Constructing Kafka Consumer
Objectives: At the end of this module, you should be able to:
- Perform Operations on Kafka
- Define Kafka Consumer and Consumer Groups
- Explain how Partition Rebalance occurs
- Describe how Partitions are assigned to Kafka Broker
- Configure Kafka Consumer
- Create a Kafka consumer and subscribe to Topics
- Describe & implement different Types of Commit
- Deserialize the received messages
Topics:
- Consumers and Consumer Groups
- Standalone Consumer
- Consumer Groups and Partition Rebalance
- Creating a Kafka Consumer
- Subscribing to Topics
- The Poll Loop
- Configuring Consumers
- Commits and Offsets
- Rebalance Listeners
- Consuming Records with Specific Offsets
- Deserializers
Hands-On:
- Creating a Kafka Consumer
- Configuring a Kafka Consumer
- Working with Offsets
Kafka Internals
Goal: Apache Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds. Learn more about tuning Kafka to meet your high-performance needs.
Skills:
- Kafka APIs
- Kafka Storage
- Configure Broker
Objectives:
At the end of this module, you should be able to:
- Understand Kafka Internals
- Explain how Replication works in Kafka
- Differentiate between In-sync and Out-off-sync Replicas
- Understand the Partition Allocation
- Classify and Describe Requests in Kafka
- Configure Broker, Producer, and Consumer for a Reliable System
- Validate System Reliabilities
- Configure Kafka for Performance Tuning
Topics:
- Cluster Membership
- The Controller
- Replication
- Request Processing
- Physical Storage
- Reliability
- Broker Configuration
- Using Producers in a Reliable System
- Using Consumers in a Reliable System
- Validating System Reliability
- Performance Tuning in Kafka
Hands On:
- Create topic with partition & replication factor 3 and execute it on multi-broker cluster
- Show fault tolerance by shutting down 1 Broker and serving its partition from another broker
Kafka Cluster Architectures & Administering Kafka
Goal: Kafka Cluster typically consists of multiple brokers to maintain load balance. ZooKeeper is used for managing and coordinating Kafka broker. Learn about Kafka Multi-Cluster Architectures, Kafka Brokers, Topic, Partitions, Consumer Group, Mirroring, and ZooKeeper Coordination in this module.
Skills:
- Administer Kafka
Objectives:
At the end of this module, you should be able to
- Understand Use Cases of Cross-Cluster Mirroring
- Learn Multi-cluster Architectures
- Explain Apache Kafka’s MirrorMaker
- Perform Topic Operations
- Understand Consumer Groups
- Describe Dynamic Configuration Changes
- Learn Partition Management
- Understand Consuming and Producing
- Explain Unsafe Operations
Topics:
- Use Cases - Cross-Cluster Mirroring
- Multi-Cluster Architectures
- Apache Kafka’s MirrorMaker
- Other Cross-Cluster Mirroring Solutions
- Topic Operations
- Consumer Groups
- Dynamic Configuration Changes
- Partition Management
- Consuming and Producing
- Unsafe Operations
Hands on:
- Topic Operations
- Consumer Group Operations
- Partition Operations
- Consumer and Producer Operations
Kafka Monitoring and Kafka Connect
Goal: Learn about the Kafka Connect API and Kafka Monitoring. Kafka Connect is a scalable tool for reliably streaming data between Apache Kafka and other systems.
Skills:
- Kafka Connect
- Metrics Concepts
- Monitoring Kafka
Objectives: At the end of this module, you should be able to:
- Explain the Metrics of Kafka Monitoring
- Understand Kafka Connect
- Build Data pipelines using Kafka Connect
- Understand when to use Kafka Connect vs Producer/Consumer API
- Perform File source and sink using Kafka Connect
Topics:
- Considerations When Building Data Pipelines
- Metric Basics
- Kafka Broker Metrics
- Client Monitoring
- Lag Monitoring
- End-to-End Monitoring
- Kafka Connect
- When to Use Kafka Connect?
- Kafka Connect Properties
Hands on:
- Kafka Connect
Kafka Stream Processing
Goal: Learn about the Kafka Streams API in this module. Kafka Streams is a client library for building mission-critical real-time applications and microservices, where the input and/or output data is stored in Kafka Clusters.
Skills:
- Stream Processing using Kafka
Objectives:
- At the end of this module, you should be able to,
- Describe What is Stream Processing
- Learn Different types of Programming Paradigm
- Describe Stream Processing Design Patterns
- Explain Kafka Streams & Kafka Streams API
Topics:
- Stream Processing
- Stream-Processing Concepts
- Stream-Processing Design Patterns
- Kafka Streams by Example
- Kafka Streams: Architecture Overview
Hands on:
- Kafka Streams
- Word Count Stream Processing
Integration of Kafka With Hadoop, Storm and Spark
Goal: In this module, you will learn about Apache Hadoop, Hadoop Architecture, Apache Storm, Storm Configuration, and Spark Ecosystem. In addition, you will configure Spark Cluster, Integrate Kafka with Hadoop, Storm, and Spark.
Skills:
- Kafka Integration with Hadoop
- Kafka Integration with Storm
- Kafka Integration with Spark
Objectives:
At the end of this module, you will be able to:
- Understand What is Hadoop
- Explain Hadoop 2.x Core Components
- Integrate Kafka with Hadoop
- Understand What is Apache Storm
- Explain Storm Components
- Integrate Kafka with Storm
- Understand What is Spark
- Describe RDDs
- Explain Spark Components
- Integrate Kafka with Spark
Topics:
- Apache Hadoop Basics
- Hadoop Configuration
- Kafka Integration with Hadoop
- Apache Storm Basics
- Configuration of Storm
- Integration of Kafka with Storm
- Apache Spark Basics
- Spark Configuration
- Kafka Integration with Spark
Hands On:
- Kafka integration with Hadoop
- Kafka integration with Storm
- Kafka integration with Spark
Integration of Kafka With Talend and Cassandra
Goal: Learn how to integrate Kafka with Flume, Cassandra and Talend.
Skills:
- Kafka Integration with Flume
- Kafka Integration with Cassandra
- Kafka Integration with Talend
Objectives:
At the end of this module, you should be able to,
- Understand Flume
- Explain Flume Architecture and its Components
- Setup a Flume Agent
- Integrate Kafka with Flume
- Understand Cassandra
- Learn Cassandra Database Elements
- Create a Keyspace in Cassandra
- Integrate Kafka with Cassandra
- Understand Talend
- Create Talend Jobs
- Integrate Kafka with Talend
Topics:
- Flume Basics
- Integration of Kafka with Flume
- Cassandra Basics such as and KeySpace and Table Creation
- Integration of Kafka with Cassandra
- Talend Basics
- Integration of Kafka with Talend
Hands On:
- Kafka demo with Flume
- Kafka demo with Cassandra
- Kafka demo with Talend
Kafka In-Class Project
Goal: In this module, you will work on a project, which will be gathering messages from multiple
sources.
Scenario:
In E-commerce industry, you must have seen how catalog changes frequently. Most deadly problem they face is “How to make their inventory and price
consistent?”.
There are various places where price reflects on Amazon, Flipkart or Snapdeal. If you will visit Search page, Product Description page or any ads on Facebook/google. You will find there are some mismatch in price and availability. If we see user point of view that’s very disappointing because he spends more time to find better products and at last if he doesn’t purchase just because of consistency.
Here you have to build a system which should be consistent in nature. For example, if you are getting product feeds either through flat file or any event
stream you have to make sure you don’t lose any events related to product specially inventory and price.
If we talk about price and availability it should always be consistent because there might be possibility that the product is sold or the seller doesn’t want to sell it anymore or any other reason. However, attributes like Name, description doesn’t make that much noise if not updated on time.
Problem Statement
You have given set of sample products. You have to consume and push products to Cassandra/MySQL once we get products in the consumer. You have to save below-mentioned fields in Cassandra.
1. PogId
2. Supc
3. Brand
4. Description
5. Size
6. Category
7. Sub Category
8. Country
9. Seller Code
In MySQL, you have to store
1. PogId
2. Supc
3. Price
4. Quantity
Certification Project
This Project enables you to gain Hands-On experience on the concepts that you have learned as part of this Course.
You can email the solution to our Support team within 2 weeks from the Course Completion Date. Edureka will evaluate the solution and award a Certificate with a Performance-based Grading.
Problem Statement:
You are working for a website techreview.com that provides reviews for different technologies. The company has decided to include a new feature in the website which will allow users to compare the popularity or trend of multiple technologies based on twitter feeds. They want this comparison to happen in real time. So, as a big data developer of the company, you have been task to implement following things:
• Near Real Time Streaming of the data from Twitter for displaying last minute`s count of people tweeting about a particular technology.
• Store the twitter count data into Cassandra.
-
7Big Data Master Program Capstone Project
Project Details
Retail Case Study
The capstone project will provide you with a business case. You will need to solve this by applying all the skills you’ve learned in the courses of the master’s program. This Capstone project will require you to apply the following skills
• Data Modelling in Cassandra
• Using Kafka as real time messaging system
• Stream data from different sources using Spark
• Analysing Data using Spark
• Leveraging NoSQL database such as Cassandra as a part of data storage strategy
• Using MapReduce for analysis of the data
• Data Warehousing and Data exploration using Hive