Tuesday, July 4, 2017

Hadoop Basics




HADOOP:


Hadoop has two major components HDFS and MapReduce. HDFS is storage and MapReduce is programming framework.


Hadoop is a framework which allows us to perform perform parallel and distributed computations on large data sets. Hadoop has fundamentally two units - storage and processing. You get HDFS as a storage service with Hadoop to store large data sets. Basically, HDFS is a distributed file system that allows you to store large file across the Hadoop cluster.


 


Hadoop Common: The common utilities that support the other Hadoop modules.


Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.


Hadoop YARN: A framework for job scheduling and cluster resource management.


Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.


 


APACHE SPARK:


Spark is another execution framework. Like MapReduce, it works with the file system to distribute your data across the cluster, and process that data in parallel. Like MapReduce, it also takes a set of instructions from an application written by a developer. MapReduce was generally coded from Java; Spark supports not only Java, but also Python and Scala, which is a newer language that contains some attractive properties for manipulating data.


 


What are the Spark use cases?


Databricks (a company founded by the creators of Apache Spark) lists the following cases for Spark:


  • Data integration and ETL
  • Interactive analytics or business intelligence
  • High performance batch computation
  • Machine learning and advanced analytics
  • Real-time stream processing
     
    SCALA:
    Scala is an acronym for “Scalable Language”. To some, Scala feels like a scripting language. Its syntax is concise and low ceremony; its types get out of the way because the compiler can infer them.
    Scala is a pure-bred object-oriented language. Conceptually, every value is an object and every operation is a method-call.
    The language supports advanced component architectures through classes and traits.
    Many traditional design patterns in other languages are already natively supported. For instance, singletons are supported through object definitions and visitors are supported through pattern matching.
    Using implicit classes, Scala even allows you to add new operations to existing classes, no matter whether they come from Scala or Java!
     
    HIVE:
    Apache Hive is considered the defacto standard for interactive SQL queries over petabytes of data in Hadoop.
    Hadoop was built to organize and store massive amounts of data of all shapes, sizes and formats. Because of Hadoop’s “schema on read” architecture, a Hadoop cluster is a perfect reservoir of heterogeneous data—structured and unstructured—from a multitude of sources.
    Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight.
     


Feature
Description
Familiar
Query data with a SQL-based language
Fast
Interactive response times, even over huge datasets
Scalable and Extensible
As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance
Compatible
Works with traditional data integration and data analytics tools.


 


  • The tables in Hive are similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units. Databases are comprised of tables, which are made up of partitions. Data can be accessed via a simple query language and Hive supports overwriting or appending data.
  • Within a particular database, data in the tables is serialized and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Data within partitions can be further broken down into buckets.
  • Hive supports all the common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE, FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine primitive data types to form complex data types, such as structs, maps and arrays.
     
    MONGO DB:


  • MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time
  • The document model maps to the objects in your application code, making data easy to work with
  • Ad hoc queries, indexing, and real time aggregation provide powerful ways to access and analyze your data
  • MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution are built in and easy to use
  • MongoDB is free and open-source, published under the GNU Affero General Public License
     
     
     

Friday, April 7, 2017

Save raw data in a sheet in existing excel workbook

I want to import a CSV file and save it in a Sheet in my existing excel workbook






Sub ImportRawData()


'Import Raw data


Const strFileName = "C:\Data\RawData.xlsx" '<<--Change this secotion for all region
    Dim wbkS As Workbook
    Dim wshS As Worksheet
    Dim wshT As Worksheet
   
    'Delete any Raw Data  if present previously.
    Application.DisplayAlerts = False
    For Each Sheet In ActiveWorkbook.Worksheets
          If Sheet.Name = "RawData" Then
          Sheet.Delete
          End If
    Next Sheet
    Application.DisplayAlerts = True
   
    'Insert RawData to new sheet
    Set wshT = Worksheets.Add(After:=Worksheets(Worksheets.Count))
    Set wbkS = Workbooks.Open(Filename:=strFileName)
    Set wshS = wbkS.Worksheets(1)
    wshS.UsedRange.Copy Destination:=wshT.Range("A1")
    wbkS.Close SaveChanges:=False
    ActiveSheet.Name = "RawData"
   
    'Save the Last Row and Last Column of raw Data into a varible
   
    Dim lC As Long
    Dim LR As Long
    With ThisWorkbook.Sheets("RawData")
        lC = .Cells(1, .Columns.Count).End(xlToLeft).Column
        LR = .Cells(.Rows.Count, "A").End(xlUp).Row
    End With
   
    'Secretly save the values in a location in Automation sheet invisible to users.
   
    Sheets("Automation").Select
    ThisWorkbook.Sheets("Automation").Range("B34").Value = lC
    ThisWorkbook.Sheets("Automation").Range("B36").Value = LR
    Sheets("Automation").Select
      
End Sub

Finding difference two columns untill empty cells encountered


This piece of code will find difference between two column until it finds empty cell.




Sheets("Collateral Type Codes").Select
Sheets("Collateral Type Codes").Cells(3, 7).Value = "Difference"
For i = 4 To 100000
If IsEmpty(Sheets("Collateral Type Codes").Cells(i, 2)) Then
GoTo Label2
End If
Sheets("Collateral Type Codes").Cells(i, 7).Value = Sheets("Collateral Type Codes").Cells(i, 3).Value - Sheets("Collateral Type Codes").Cells(i, 4).Value
Next i

Label2:
ActiveSheet.Range("G3").Select
    Selection.AutoFilter
    ActiveSheet.Columns("D:D").AutoFilter Field:=1, Criteria1:="<>0", Operator:=xlFilterValues

Creating a Pivot table using VBA

We are trying to create a vba script, where we are taking input for the Row lable, Column label,  and Value field from a metadata sheet. The Raw data is a separate sheet from which we are generating the pivots.


Below are the steps:




Sub CreatePivot()


Dim PSheet As Worksheet
Dim DSheet As Worksheet
Dim PCache As PivotCache
Dim PTable As PivotTable
Dim PRange As Range
Dim lastRow As Long
Dim lastCol As Long



'Delete Preivous Pivot Table Worksheet &amp; Insert a New Blank


Worksheet With Same Name
On Error Resume Next
Application.DisplayAlerts = False
Worksheets("Collateral Type Codes").Delete
Sheets("RawData").Select
Sheets.Add After:=ActiveSheet
ActiveSheet.Name = "Collateral Type Codes"
Application.DisplayAlerts = True
Set PSheet = Worksheets("Collateral Type Codes")
Set DSheet = Worksheets("RawData")



'Define Data Range


lastRow = DSheet.Cells(Rows.Count, 1).End(xlUp).Row
lastCol = DSheet.Cells(1, Columns.Count).End(xlToLeft).Column
Set PRange = DSheet.Cells(1, 1).Resize(lastRow, lastCol)



'Define Pivot Cache


Set PCache = ActiveWorkbook.PivotCaches.Create _
(SourceType:=xlDatabase, SourceData:=PRange). _
CreatePivotTable(TableDestination:=PSheet.Cells(2, 2), _
TableName:="MerivalPivotTable")

Dim Col1 As String
Dim Row1 As String
Dim RowArray() As String
Dim ColArray() As String
Dim Value As String
Dim ValueArray() As String
Dim Func

Row1 = Sheets("PivotMetaData").Cells(6, 3).Value
Col1 = Sheets("PivotMetaData").Cells(6, 4).Value
Value = Sheets("PivotMetaData").Cells(6, 5).Value
Func = Sheets("PivotMetaData").Cells(6, 6).Value

RowArray = Split(Row1, ",")
ColArray = Split(Col1, ",")
ValArray = Split(Value, ",")
Set PTable = PCache.CreatePivotTable(TableDestination:=PSheet.Cells(1, 1), TableName:="MerivalPivotTable")



'Insert Row Fields


For i = 0 To UBound(RowArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(RowArray(i))
 .Orientation = xlRowField
 .Position = i + 1
End With
Next I



'Insert Column Fields


For j = 0 To UBound(ColArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ColArray(j))
 .Orientation = xlColumnField
 .Position = j + 1
End With
Next j



'Insert Data Field


If Func = "Sum" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlSum
 .NumberFormat = "#,##0"
 End With
Next k
End If



If Func = "Count" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlCount
 .NumberFormat = "#,##0"
 End With
Next k
End If



If Func = "Average" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlAverage
 .NumberFormat = "#,##0"
 End With
Next k
End If



If Func = "Max" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlMax
 .NumberFormat = "#,##0"
End With
Next k
End If



If Func = "Min" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlMin
 .NumberFormat = "#,##0"
End With
Next k
End If



If Func = "Product" Then
For k = 0 To UBound(ValArray)
With ActiveSheet.PivotTables("MerivalPivotTable").PivotFields(ValArray(k))
 .Orientation = xlDataField
 .Position = k + 1
 .Function = xlProduct
 .NumberFormat = "#,##0"
End With
Next k

End If

How to export few sheets of current workbook into a new workbook:

Here we are trying to export few sheets from our current workbook and save it in a new workbook. The new workbook should automatically replace any already available workbook (if present).


Sub ExportSheets()


Application.DisplayAlerts = False
ThisWorkbook.Sheets(Array("Sheet1", "Sheet2", "Sheet3")).Move
ActiveWorkbook.SaveAs "
C:\users\AB\NewWorkbook.xlsx", FileFormat:=51, AccessMode:=xlExclusive, ConflictResolution:=Excel.XlSaveConflictResolution.xlLocalSessionChanges
ActiveWorkbook.Close (True)



End Sub


Here we are using the move option, which means it will cut the sheets from the existing workbook and paste it in a new workbook and save it by the name NewWorkbook.xlsx.


The clause :
ConflictResolution:=Excel.XlSaveConflictResolution.xlLocalSessionChanges
ensures that the sheet will always overwrite any existing file by the same name.


And as we are keeping Application.DisplayAlerts = False
Excel will not prompt users with messages.

Tuesday, April 4, 2017

Import a CSV file into excel without User prompt

I was working in an end to end automation project, where I required to import a CSV file into my existing excel workbook and name it Raw Data. based on this Raw Data sheet I intend to Draw multiple pivots. This importing of CSV and generating pivots required to be iterated for nine different regions. This is a series of Blogs through which I shall be explaining the entire automation.


So without any much delay lets start with the initial part of importing of CSV file into our excel.


Sub ImportCSV()
   
    Const strFileName = "C:\Users\AB\Downloads\Sample.csv"
    Dim wbkS As Workbook
    Dim wshS As Worksheet
    Dim wshT As Worksheet
   
    'Delete any Raw Data if present previously.
    Application.DisplayAlerts = False
    For Each Sheet In ActiveWorkbook.Worksheets
          If Sheet.Name = "RawData" Then
          Sheet.Delete
          End If
    Next Sheet
    Application.DisplayAlerts = True
   
    'Insert RawData from Sample.csv
    Set wshT = Worksheets.Add(After:=Worksheets(Worksheets.Count))
    Set wbkS = Workbooks.Open(Filename:=strFileName)
    Set wshS = wbkS.Worksheets(1)
    wshS.UsedRange.Copy Destination:=wshT.Range("A1")
    wbkS.Close SaveChanges:=False
    ActiveSheet.Name = "RawData"
End Sub



By this code sample we are importing Sample.csv from a specified location and placing it in our existing workbook and naming the sheet as RawData. We basically opening Sample.csv, copying data from it and then closing it.




Checking if the CSV file exists of not:





Sub ImportCSV()
   
    Const strFileName = "C:\Users\AB\Downloads\Sample.csv"

 
    If Dir(strFileName) <> "" Then
   
       
    Dim wbkS As Workbook
    Dim wshS As Worksheet
    Dim wshT As Worksheet
   
    'Delete any Raw Data if present previously.
    Application.DisplayAlerts = False
    For Each Sheet In ActiveWorkbook.Worksheets
          If Sheet.Name = "RawData" Then
          Sheet.Delete
          End If
    Next Sheet
    Application.DisplayAlerts = True
   
    'Insert RawData from Sample.csv
    Set wshT = Worksheets.Add(After:=Worksheets(Worksheets.Count))
    Set wbkS = Workbooks.Open(Filename:=strFileName)
    Set wshS = wbkS.Worksheets(1)
    wshS.UsedRange.Copy Destination:=wshT.Range("A1")
    wbkS.Close SaveChanges:=False
    ActiveSheet.Name = "RawData"
     
   
Else
    Exit Sub
End If
      
End Sub